Global ETD Search

1	Implementação de uma abordagem híbrida utilizando modelagem comparativa e ab initio para predição de estruturas tridimensionais de proteínas contendo múltiplos domínios com conectores flexíveis / Implementation of a hybrid approach using comparative and ab initio modelling to predict the three dimensional structure of proteins containing multiple domains and flexible connectors Honorato, Rodrigo Vargas 17 November 2015 (has links) Domínio proteico é uma sequência de aminoácidos evolutivamente conservada e funcionalmente independente. Um dos aspectos mais importantes do estudo de uma proteína que contem múltiplos domínios é o entendimento da comunicação, entre os diferentes domínios, e seu papel biológico. Essa comunicação em maior parte é feita pela interação direta entre domínios. A interação poderia ser tratada como uma clássica interação proteína-proteína. Entretanto, proteínas multidomínio possuem restrições determinadas por suas regiões conectoras. Os conectores interdomínio impõem restrições e limitam espaço conformacional dos domínios. Apresentamos aqui o MAD, uma rotina capaz de obter modelos tridimensionais de alta resolução para proteínas, contendo qualquer número de domínios, a partir de sua sequencia primária. Os domínios conservados são identificados utilizando a base de domínios conservados (CDD) e seus limites são utilizados para definir as regiões conectoras. É criado um ensamble de possíveis dobramentos dos conectores e sua distribuição de distâncias C/N-terminais são utilizadas como restrição espacial na busca pela interação entre os domínios.Os modelos dos domínios são obtidos por uma modelagem comparativa. Foi implementada uma heurística, capaz de lidar com a natureza combinatorial dos múltiplos domínios e com a necessidade imposta pela limitação computacional de realizar o docking dos domínios em forma de pares. Todas combinações de domínios são submetidas as rotinas de docking. Aplica-se filtro de distância e energético, excluindo as conformações que apresentam distância C/N-terminal entre domínios maior do que o valor máximo observado no ensamble de conectores e seleciona as conformações energeticamente mais favoráveis. As conformações são submetidas a uma rotina de agrupamento hierárquico baseada em sua similaridade estrutural. Para a segunda fase as conformações selecionadas são pareadas com seu domínio complementar e ressubmetidas a rotina de docking até que todas as fases tenham sido completadas. Foi criado um conjunto de testes a partir do Protein Data Bank contendo 54 proteínas multidomínio para que a rotina de docking do MAD fosse comparada com outros softwares utilizados pela comunidade cientifica, mostrou-se superior ou equivalente aos métodos testados. A capacidade de utilizar dados experimentais foi demostrada através da proposição de um modelo da forma ativa da enzima tirosina fosfatase 2, nunca observado experimentalmente. A rotina de docking foi expandida paralelamente em uma aplicação standalone e utilizada na resolução de diversos problemas biológicos. Concluímos que a inovação metodológica proposta pelo MAD é de grande valia para a modelagem molecular e tem potencial de gerar uma nova perspectiva a respeito da interação de proteína multidomínio, visto que é possível analisar essas proteínas em sua plenitude e não como domínios separados. / Protein domain is an evolutionary conserved and functionally independent amino acid sequence. One of the most important aspects of the study of a protein that contains multiple domains is the understanding of communication between the different areas, and their biological role. This communication is made mostly by direct interaction between domains. The interaction could be treated as a classical protein-protein interaction. However, multidomain proteins have certain restrictions for its connector regions. The intra connectors impose restrictions and limit conformational space of the domains. We present the MAD, a routine able to get three-dimensional models of high-resolution protein, containing any number of domains, from its primary sequence. The conserved domains are identified using the basic conserved domains database (CDD) and its boundaries are used to define the connector regions. This creates a ensemble of possible folding of the connectors and distribution of distances C/N-terminals are used as spatial restriction in the search for interaction between domains.Os models of the domains are obtained by comparative modelling. A heuristic able to handle the combinatorial nature of the multiple areas and the need imposed by the computer to perform the limitation of the docking areas as pairs was implemented. All combinations of domains are referred to the docking routines. Distance and energy filters are applied, excluding conformations that have C/N-terminal domains distances larger than the maximum value observed in the connectors ensemble and selects the most favourable energy conformations. Conformations are subjected to hierarchical clustering routine based on their structural similarity. For the second phase, the selected conformations are paired with its complementary domain and resubmitted to the docking routine until all phases have been completed. A test set has been created from the Protein Data Bank containing 54 multidomain proteins so that the docking routine of MAD could be compared with other software used by the scientific community, it has been shown to be superior or equivalent to the tested methods. The ability to use experimental data was demonstrated by proposing a model of the active form of tyrosine phosphatase enzyme 2, never observed experimentally. The docking routine was expanded in a standalone application and used in solving various biological problems. We conclude that the methodological innovation proposed by the MAD is very useful for molecular modelling and has the potential to generate a new perspective on multidomain protein interaction as you can analyse these proteins in its entirety and not as separate domains. Interação proteína-proteína Modelagem molecular Molecular modelling Multidomain proteins Protein-protein interaction Proteínas multidomínio
2	Implementação de uma abordagem híbrida utilizando modelagem comparativa e ab initio para predição de estruturas tridimensionais de proteínas contendo múltiplos domínios com conectores flexíveis / Implementation of a hybrid approach using comparative and ab initio modelling to predict the three dimensional structure of proteins containing multiple domains and flexible connectors Rodrigo Vargas Honorato 17 November 2015 (has links) Domínio proteico é uma sequência de aminoácidos evolutivamente conservada e funcionalmente independente. Um dos aspectos mais importantes do estudo de uma proteína que contem múltiplos domínios é o entendimento da comunicação, entre os diferentes domínios, e seu papel biológico. Essa comunicação em maior parte é feita pela interação direta entre domínios. A interação poderia ser tratada como uma clássica interação proteína-proteína. Entretanto, proteínas multidomínio possuem restrições determinadas por suas regiões conectoras. Os conectores interdomínio impõem restrições e limitam espaço conformacional dos domínios. Apresentamos aqui o MAD, uma rotina capaz de obter modelos tridimensionais de alta resolução para proteínas, contendo qualquer número de domínios, a partir de sua sequencia primária. Os domínios conservados são identificados utilizando a base de domínios conservados (CDD) e seus limites são utilizados para definir as regiões conectoras. É criado um ensamble de possíveis dobramentos dos conectores e sua distribuição de distâncias C/N-terminais são utilizadas como restrição espacial na busca pela interação entre os domínios.Os modelos dos domínios são obtidos por uma modelagem comparativa. Foi implementada uma heurística, capaz de lidar com a natureza combinatorial dos múltiplos domínios e com a necessidade imposta pela limitação computacional de realizar o docking dos domínios em forma de pares. Todas combinações de domínios são submetidas as rotinas de docking. Aplica-se filtro de distância e energético, excluindo as conformações que apresentam distância C/N-terminal entre domínios maior do que o valor máximo observado no ensamble de conectores e seleciona as conformações energeticamente mais favoráveis. As conformações são submetidas a uma rotina de agrupamento hierárquico baseada em sua similaridade estrutural. Para a segunda fase as conformações selecionadas são pareadas com seu domínio complementar e ressubmetidas a rotina de docking até que todas as fases tenham sido completadas. Foi criado um conjunto de testes a partir do Protein Data Bank contendo 54 proteínas multidomínio para que a rotina de docking do MAD fosse comparada com outros softwares utilizados pela comunidade cientifica, mostrou-se superior ou equivalente aos métodos testados. A capacidade de utilizar dados experimentais foi demostrada através da proposição de um modelo da forma ativa da enzima tirosina fosfatase 2, nunca observado experimentalmente. A rotina de docking foi expandida paralelamente em uma aplicação standalone e utilizada na resolução de diversos problemas biológicos. Concluímos que a inovação metodológica proposta pelo MAD é de grande valia para a modelagem molecular e tem potencial de gerar uma nova perspectiva a respeito da interação de proteína multidomínio, visto que é possível analisar essas proteínas em sua plenitude e não como domínios separados. / Protein domain is an evolutionary conserved and functionally independent amino acid sequence. One of the most important aspects of the study of a protein that contains multiple domains is the understanding of communication between the different areas, and their biological role. This communication is made mostly by direct interaction between domains. The interaction could be treated as a classical protein-protein interaction. However, multidomain proteins have certain restrictions for its connector regions. The intra connectors impose restrictions and limit conformational space of the domains. We present the MAD, a routine able to get three-dimensional models of high-resolution protein, containing any number of domains, from its primary sequence. The conserved domains are identified using the basic conserved domains database (CDD) and its boundaries are used to define the connector regions. This creates a ensemble of possible folding of the connectors and distribution of distances C/N-terminals are used as spatial restriction in the search for interaction between domains.Os models of the domains are obtained by comparative modelling. A heuristic able to handle the combinatorial nature of the multiple areas and the need imposed by the computer to perform the limitation of the docking areas as pairs was implemented. All combinations of domains are referred to the docking routines. Distance and energy filters are applied, excluding conformations that have C/N-terminal domains distances larger than the maximum value observed in the connectors ensemble and selects the most favourable energy conformations. Conformations are subjected to hierarchical clustering routine based on their structural similarity. For the second phase, the selected conformations are paired with its complementary domain and resubmitted to the docking routine until all phases have been completed. A test set has been created from the Protein Data Bank containing 54 multidomain proteins so that the docking routine of MAD could be compared with other software used by the scientific community, it has been shown to be superior or equivalent to the tested methods. The ability to use experimental data was demonstrated by proposing a model of the active form of tyrosine phosphatase enzyme 2, never observed experimentally. The docking routine was expanded in a standalone application and used in solving various biological problems. We conclude that the methodological innovation proposed by the MAD is very useful for molecular modelling and has the potential to generate a new perspective on multidomain protein interaction as you can analyse these proteins in its entirety and not as separate domains. Interação proteína-proteína Modelagem molecular Proteínas multidomínio Molecular modelling Multidomain proteins Protein-protein interaction
3	Conception de protéines artificielles multidomaines / Conception of multidomain artificial proteins Léger, Corentin 12 November 2018 (has links) La création de nouvelles fonctions basées sur la reconnaissance protéique et sur l'assemblage de domaines est un enjeu majeur en biotechnologie et est un moyen de comprendre les relations structures/fonctions des protéines engagées dans des processus d'interactions. Aujourd’hui, des bibliothèques de protéines artificielles obtenues par ingénierie peuvent être sources de protéines aux propriétés de reconnaissance analogues à celles des dérivés d’anticorps.L’équipe Modélisation et Ingénierie des Protéines a ainsi construit une banque de protéines à motifs structuraux répétés appelées « alphaReps ». Les alphaReps présentent des propriétés remarquables en termes de production et de stabilité. Contrairement à la plupart des anticorps et dérivés d’anticorps, elles peuvent même s’exprimer sous forme fonctionnelle dans le cytoplasme de cellules eucaryotes. De tels objets peuvent donc maintenant être utilisés comme des briques élémentaires en vue d’une ingénierie modulaire. Ainsi la construction de nouvelles fonctions de reconnaissance optimisées tant au niveau de la spécificité que de l’affinité sera possible en réarrangeant et/ou dupliquant ces briques élémentaires.Un premier volet de ce projet de thèse a consisté à construire puis étudier les propriétés biophysiques de protéines bidomaines basées sur les alphaReps afin de mieux comprendre les comportements adoptés par de telles constructions. Outre l’aspect fondamental de cette question, cette étude donnera « les règles » pour moduler de façon contrôlée les interactions entre ces protéines. Les résultats montrent qu'il est possible de créer de nouvelles fonctions par simple ajout d'un linker entre deux alphaReps : avidité, coopérativité, changement de conformation.Dans un second temps, l’objectif a été de développer, à partir des protéines bidomaines précédemment étudiées, de nouveaux biosenseurs basés sur le FRET (Förster Resonance Energy Transfer) pouvant être utilisés in vivo et in vitro. Cette deuxième partie présente deux biosenseurs avec des limites de détection de l'ordre du nanomolaire. Les alphaReps utilisées dans ces constructions pouvant être changées en fonction de la cible souhaitée, il s'agit ici d'une preuve de concept pouvant être généralisée à n'importe quelle cible.Enfin la dernière partie de cette thèse s'est portée sur la conception et l'étude de nouveaux biosenseurs génétiquement codables. Ces biosenseurs présentent notamment l’avantage d’être utilisables immédiatement après production et ne nécessitent donc plus d’étape de couplage chimique. Les résultats obtenus montrent que la création de tels biosenseurs est possible mais qu’une optimisation reste encore nécessaire pour améliorer leur spécificité, leur stabilité et leur capacité de détection. / The creation of new protein functions based on recognition and molecular assembly is not only a major goal in biotechnology but is also a means to understand the relation structure/function of proteins involved in interaction processes. Today, libraries of artificial proteins obtained by engineering can be a source of proteins with recognition properties similar to the properties of antibodies.The team Protein Engineering and Modeling has thus created a library of proteins with structural repeats called the “alphaReps”. The alphaReps present remarkable properties in terms of production and stability. Unlike most of the antibodies and their derivatives, they can even be expressed and functional in the cytoplasm of eukaryotic cells. Such objects can therefore be used as building bricks in modular engineering. The construction of new optimized recognition functions both in specificity and in affinity can then be possible by rearranging or duplicating these elementary bricks.The first part of this thesis project consisted in the construction and study of the biophysical properties of bidomain proteins based on alphaRep in order to have a better understanding of the behaviour of such constructions. Beside the fundamental aspect of this question, this study will give the “rules” to modulate the interactions between these proteins in a controlled way. The results show that it is possible to create new functions such as avidity, cooperativity, conformational change, simply by adding a linker between two alphaReps.In a second step, the goal was to develop, with the bidomain proteins previously studied, new biosensors based on the FRET (Förster Resonance Energy Transfer) which can be used in vivo and in vitro. This second part presents two biosensors with limits of detection in the nanomolar range. Since the alphaReps used in these constructions can be changed depending on the chosen target, it is a proof of concept which can be adapted to any desired target.Finally, the third part of this thesis focused on the development of genetically codable biosensors. These biosensors have the particular advantage of being usable directly after production and therefore no longer require a chemical coupling step. The results show that the development of such biosensors is worth considering but an optimization is still required in order to improve their specificity, their stability and their detection capacity. Protéines artificielles Protéines multidomaines Biochimie des protéines Biologie moléculaire Biophysique Artificial proteins Multidomain proteins Biochemistry Molecular Biology Biophysics
4	Estudos de macromoléculas biológicas parcialmente desestruturadas usando espalhamento de raios-X / Study of partially unstructured macromolecules using X-ray scattering Silva, Júlio César da 15 August 2018 (has links) Orientador: Iris Concepción Linares de Torriani / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Fisica Gleb Wataghin / Made available in DSpace on 2018-08-15T22:30:16Z (GMT). No. of bitstreams: 1 Silva_JulioCesarda_D.pdf: 7791031 bytes, checksum: 5711654f743b7d6fb045861e9239ad8c (MD5) Previous issue date: 2010 / Resumo: As técnicas de caracterização estrutural de macromoléculas tradicionais se baseiam no fato de uma macromolécula possuir uma conformação compacta e estruturada. Partes flexíveis ou regiões desordenadas têm sido sempre consideradas como grandes obstáculos para técnicas como a cristalografia de raios-X e a ressonância magnética nuclear (RMN). A necessidade de entender a atividade funcional de proteínas nativamente desenoveladas e de proteínas flexíveis com múltiplos domínios tem adquirido grande importância recentemente, mesmo porque essas proteínas desafiam o paradigma de que uma proteína precisa de uma estrutura bem definida para ser funcional. É bem nesse ponto que a técnica de espalhamento de raios-X a baixos ângulos (SAXS) surge oferecendo ferramentas únicas para realizar estudos de macromoléculas flexíveis ou parcialmente desestruturadas, com aplicações muito bem sucedidas em polímeros, matéria mole e macromoléculas em solução. Neste trabalho de tese decidimos enfrentar o desafio de caracterizar proteínas que não possuem uma estrutura bem definida. A teoria do espalhamento mereceu especial cuidado para se adequar tanto aos métodos experimentais da técnica quanto aos tratamentos matemáticos em cálculos usados para estudar esse tipo de proteínas. Apresentamos aqui o estudo de duas proteínas pertencentes à classe das proteínas nativamente desenoveladas: (1) a proteína FEZ1, que é necessária para o crescimento de axônios; (2) a proteína Ki-1/57, que é encontrada em diversas células com câncer principalmente em tumores do sistema linfático. Estudamos também algumas proteínas com múltiplos domínios conectados por regiões flexíveis e que são: (1) duas chaperonas da classe das HSP40 (proteínas Sis1 e Ydj1) juntamente com construções onde alguns domínios dessas proteínas foram cortados; (2) a proteína ribonucléica heterogênea hnRNP-Q que está envolvida em importantes funções do RNA. Experiências de SAXS foram realizadas, fornecendo parâmetros dimensionais e informações de forma dessas proteínas em solução. Modelos de baixa resolução das possíveis conformações foram calculados a partir das curvas de SAXS usando métodos de modelagem ab initio combinados com modelagem de corpos rígidos. Os resultados forneceram informações importantes para elucidar as funções biológicas dessas proteínas. É importante ressaltar que, para realizar os estudos com proteínas em solução, é necessário contar com uma instrumentação adequada e devidamente montada para a aplicação da técnica de SAXS. Para isso, durante o período de desenvolvimento deste doutorado houve um grande investimento na montagem, teste e caracterização de instrumentos, junto à equipe de profissionais do Laboratório Nacional de Luz Síncrotron (LNLS), completando o comissionamento da estação experimental SAXS2 do LNLS / Abstract: The traditional techniques for structural characterization of macromolecules are based on a compact and structured conformation of the macromolecule. Flexible or disordered regions have usually been regarded as a great hindrance to techniques like X-ray protein crystallography and nuclear magnetic resonance (NMR). The need to study functional activity of natively unfolded proteins and flexible multidomain proteins came to the light rather recently, defying the classical structure¿function paradigm where a protein must have a well-defined 3-D structure to be functional. In this type of situation, the small-angle X-ray scattering (SAXS) technique appears as a unique tool to deal with this problem. Indeed, the application of SAXS methods to the characterization of soft matter (e.g. polymers) and macromolecules in solution has already succeeded during the last years. In this work we decided to face the challenge of characterizing proteins that do not have a well defined structure. The SAXS experimental technique as well as the mathematical methods and calculations needed special attention in order to be correctly applied to study the specific problem of unstructured proteins in solution. Thus, it was possible to find evidence of the structural details of these proteins and obtain a low resolution 3-D average structure. Here we present the study of two proteins that belong to the group of natively unfolded proteins: (1) The FEZ1 protein, which is necessary for axon growth, and (2) the proteins indentified as Ki-1/57, which is found in diverse cancer cells mainly in lymphatic systems tumors. We also studied some flexible multidomain proteins: (1) two chaperones from the groups of HSP40 (the proteínas Sis1 e Ydj1), and two mutant constructions where some domains were deleted; (2) the heterogeneous ribonucleoprotein hnRNP-Q which is related to an array of important functions of RNA. Several SAXS experiments were performed providing overall parameters and important shape information about those proteins in solution. Low resolution models for the possible conformations of these proteins were restored from the SAXS curves using ab initio modeling methods combined with rigid body modeling. The SAXS results provided a unique structural background for the biologists to deal with the function of these proteins. SAXS experiments with proteins in solution demand the use of a specific instrumentation properly developed for those studies. So, it is important to mention that, throughout the duration of this doctorate, specific instrumentation development and testing was done together with the technical staff of the Brazilian Synchrotron Light Laboratory (LNLS, Campinas, SP, Brazil), collaborating with the commissioning of the new SAXS2 workstation, completed in 2008 / Doutorado / Física / Doutor em Ciências Raios X - Espalhamento a baixo ângulo Proteínas multidomínios Flexibilidade intramolecular SAXS (Small-angle X-ray scattering) Proteins intrinsically desestruturadas Multidomain proteins Intramolecular flexibility
5	Functionally Interacting Proteins : Analyses And Prediction Mohanty, Smita 11 1900 (has links) (PDF) Functional interaction of proteins is a broad term encompassing many different types of associations that are observed amongst proteins. It includes direct non-covalent interactions where the interacting proteins physically associate using an interface. There are also many protein-protein interactions where the proteins concerned are not involved in direct physical interactions but affect each other’s functions. Central focus of this thesis is to understand the various aspects of functionally interacting proteins. Chapter 1 of this thesis provides an introduction to functional interactions between proteins and discusses the key developments available in the literature. This chapter discusses the different types of functional associations observed commonly between proteins. Various approaches developed over time to elucidate such interactions have also been discussed. This chapter highlights how functional interactions between proteins have been helpful in understanding different cellular processes such as organization of metabolic pathways. The chapter emphasizes the importance of functional interactions between proteins, providing a motivation for development of methods with enhanced accuracy and sensitivity for the prediction of functional interactions. In this thesis, domain families which are found to co-exist in multidomain proteins have been used to understand and subsequently predict functional associations amongst proteins. Domains in proteins typically serve as modules associated with specific functions. There exist proteins with a single domain which describes the entire function of a protein, while there also exist proteins containing multiple domains, where various domains in unison describe the complete function of the multidomain protein. Therefore, by virtue of “guilt by association” domain families found together in multidomain proteins are functionally linked. This forms the basic premise for understanding functional association amongst proteins and is explained in great detail in the Introduction chapter. Using domain families which co-occur in multidomain proteins as the basis for functional association has many merits. First, as stated before, constituent domain families act as effective descriptors of function(s) of proteins. For example, members of SH3 domain family mediate protein-protein interactions by binding to regions with polyproline conformation irrespective of the multidomain protein in which it occurs. Thus, studies of domain families co-existing in multidomain proteins act as an accurate resource of functional associations between proteins. Also, assignment of domains to a protein relies on homology detection which has achieved a high level of reliability, thus, resulting in reasonably accurate prediction of functions. Such approaches enable exhaustive coverage of many diverse proteins including many multidomain proteins leading to detection of large numbers of functional associations between domains of multidomain proteins. Given the advantages attributed to functionally linked domain families in further understanding of functional associations, it is imperative to exhaustively enumerate all possible pairs of functionally linked domain families in multidomain proteins and study their various properties. This aspect is covered in the second chapter of the thesis. In the second chapter, analysis of domain families which co-occur in multidomain proteins, termed as 'tethered domain families', has been reported. For this analysis, a large dataset of multidomain proteins was considered from a diverse set of fully sequenced genomes from many eukaryotic and prokaryotic organisms. In every multidomain protein, all possible pairs of unique domain family pairs have been considered and they are assumed to be under the same functional/evolutionary constraint. Thus, from the entire dataset of multidomain proteins, all possible pairs of tethered domain families are obtained. For a given domain family, the number of other uniquely tethered families is referred to as the tethering number of a domain family. Therefore, tethering number of a domain family is an indicator of the diverse functional contexts in which a particular domain family is involved. Further analysis was carried out to understand various other attributes of domain families and its relation to tethering number. The results are summarized in the following points: 1) Distribution of tethering numbers of domain families in the entire dataset is found to be highly heterogeneous. Nearly 88% of domain families (10783 out of 12249 domain families) have tethering number of 10 or less and only 78 domain families show more than 100 unique associations. Further analysis reveals bias in functions of families showing high and low tethering numbers. The domain families with high tethering numbers are involved in processes such as signaling and protein-protein interactions. The domain families with low tethering numbers are often found to be involved in metabolic processes. 2) Differences are also observed in the type of organisms containing the domain families and their tethering numbers. Typically, domain families with high tethering numbers are ubiquitously found across almost all the kingdoms of life. In contrast, most of the domain families exclusively found in a kingdom have low tethering numbers. Furthermore, for the ubiquitously occurring domain families with high tethering numbers, the number of associations made and the type of associations are not strictly conserved across the kingdoms. Thus, the tethering preferences of such domain families vary across the kingdoms depending on their function. For instance, the protein kinase domain family which is a key regulator of signaling processes in eukaryotes, has a high tethering number in eukaryotes (270), and low tethering number in prokaryotes (96). 3) Tethering number of domain families is found to be correlated with the number of members (population) comprising a family. A Pearson correlation coefficient of 0.78 at a p-value ≤0.001 is obtained for the correlation between tethering number of domain families and their population. 4) Tethering numbers of domain families are also found to be well correlated with sequence and functional diversity within families. Thus, domain families with high tethering numbers comprise of members showing diversity in both sequence and functions. Thus, the work presented in second chapter provides a framework for understanding the tethering preferences of domain families. The use of tethered domain families to identify functional association amongst proteins is the central theme of third and fourth chapters of this thesis. The use of tethered domain families for the prediction of functionally interacting proteins originates from the initial idea of “Rosetta stone” approach, which was proposed by Ouzounis and coworkers and Eisenberg and coworkers in 1999. Rosetta stone approach demonstrated the use of fused genes in predicting functional interaction. It stems from the observation that in many organisms, genes corresponding to proteins acting in a metabolic pathway are found fused in another organism. Thus, enumeration of 'fused genes' in a template database could provide a good basis for prediction of functionally interacting proteins in target organisms in which the homologous genes are not found to be fused. The method has been shown, by others, to work quite effectively in prokaryotes, especially in the identification of interactions between metabolic proteins. Chapter 3 of this thesis explores the idea of “Rosetta stones” at the level of domain families, by considering tethered domain families as analogs to the fused genes. In this analysis, tethered domain families derived from multidomain proteins comprises the template dataset. If members of two domain families occurring in a multidomain protein are found to occur independently in two different proteins in the target organism then an interaction is predicted between these two proteins (collection of such predicted interactions is henceforth referred as TEDIP database, Tethered Domain-based Interaction Prediction). During this analysis, care is taken such that none of the proteins in the template dataset belongs to the target organisms. The entire analysis has been conducted on 6 model organisms which act as the target dataset where functional interactions between proteins are predicted. The effectiveness of tethered domain families in functional interaction prediction is compared with two other datasets 1) all experimentally known interactions and 2) interactions predicted on the basis of their homology with interacting domain families with known structure. Subsequently, an attempt has been made to answer these questions: 1) how effective is the information on tethered domain families in predicting functional linkages amongst proteins operating in pathways in eukaryotic organisms? 2) what is the false positive rate of the predictions? The above mentioned datasets show very little overlap in the coverage of functional interactions. This is largely attributed to insufficient sampling and inherent bias existing in each of the methods. The TEDIP datasets in the six organisms led to an average three-fold more functional interaction predictions in cellular pathways than the other two datasets. Nearly 90% of the predicted interactions derived from tethered domain families are amongst proteins across different pathways. In yeast, more than 60% of such interactions were found to be overlapping with a recent large scale genetic interaction screen based on synthetic lethality especially performed for metabolic proteins, thus establishing the effectiveness of this approach in understanding pathway crosstalk. Along with efficacy in identifying functional interactions, an assessment based on co-localization, co-expression and overall functional similarity based on Gene Ontology (GO) terms was carried out. It was found that the TEDIP predictions and experimentally found interactions show poor correspondence with co-expression and co-localization data (10% and 20% respectively for the two methods). Additionally, it was found that functional similarity between predicted interacting proteins in TEDIP dataset is low (5%) and is comparable to experimentally known interactions that shows 10% similarity in functions based on a scoring function for GO term similarity. From Chapter 3, it was concluded that the use of tethered domain families is effective in exhaustive enumeration of functionally associated proteins. However, the low co-expression and functional similarity measures are a cause for concern. On the one hand, co-expression and GO functional similarity have been found to be weak predictors of functional interactions, explaining the low values obtained for both predictions in the TEDIP datasets and experimentally known interactions. On the other hand, the poorer values shown for predictions in the TEDIP datasets suggest that further improvement in prediction accuracy is possible. Chapter 4 explores the use of machine learning in improving the accuracy of functional interaction prediction based on TEDIP dataset. In Chapter 4, two distinct machine learning approaches have been employed on a training dataset derived exclusively from yeast. Since the objective of the work is to improve the accuracy of prediction of functional interactions, the GO based functional similarity measures have been used to define positive and negative datasets. Thus, in the training dataset, positive interactions comprises of protein pairs which show high GO similarity in functions as defined in chapter 3 and 10% of this data overlaps with experimentally known interactions, while the negative dataset consists of protein pairs with no or insignificant similarity in their functions and additionally do not show similarity to any experimentally known interactions. Two machine learning approaches, namely Support vector machine (SVM) and Random forest, have been used on this training dataset. Use of two distinct approaches helps in addressing the weakness, if any, of these methods. Fourteen carefully chosen features have been utilized during the training process to aid in the process of distinguishing potentially correctly predicted interactions from incorrect predictions. Out of 14 features, some of the features chosen for the analysis are involved in quantifying the extent of similarity between the template proteins containing the fused domain families and the target protein pairs predicted to interact. The analysis also incorporates graph theory based parameters which are derived from a domain family based graph. In such a graph, each of the domain families which are involved in forming multidomain proteins represents the nodes and an edge is constructed between domain families which are found to co-exist in at least one multidomain protein. Graph theory based parameters such as clustering coefficient, degree and topological overlap have been employed. These are useful in down weighting appropriately the domain family pairs showing large number of associations which are expected to be promiscuous in their functions. These features also enable in identifying domain family pairs which are functionally related. Apart from the above mentioned features, coevolution and phylogenetic profiling of tethered domain families is also utilized to identify functionally related domain family pairs. Utilizing all these features in training, the machine learning approach yielded an accuracy of 94% using SVM and 92% using Random forest against the training data. Furthermore, the importance of using all these features has been addressed by performing principle component analysis, training both SVM and Random forest by removing one feature at a time and by quantifying the sensitivity by using only one feature. All of these suggest that the features used provide non-redundant information and contributed significantly to the classification. The models so generated were finally used on all the predicted functional interactions after the removal of the training dataset in yeast. The true positives observed were 56% using SVM and 63% using Random forest with around 80% of the interactions common between the two methods. Further analysis has been carried out on these interactions by first imparting a confidence score to these interactions using support vector regression that provides a probabilistic measure for SVM classification. Based on a cutoff of 0.5, 62455 interactions in total were termed as high confidence interactions. Further analysis was carried out for the high confidence interactions. Out of these, in 2855 interactions, both the proteins predicted to interact could be associated with a pathway in KEGG database. In-depth case studies have been performed on this dataset of 2855 interactions. Literature mining suggested that many known cross-pathway interactions such as between TCA and glycolysis are captured as high confidence interactions using TEDIP dataset. A few other case studies of high confidence interactions with supporting literature evidence are also presented in the chapter. These predictions could further aid in experimental characterization of pathway cross-talk between important metabolic and signaling pathways. So far, the thesis discussed analyses involving functional interactions and their prediction. In the subsequent chapters, analyses pertaining to two different types of functional interactions are discussed. Chapters 5 and 6 involve analyses incorporating metabolic proteins in diverse pathways in the pathogenic organism Plasmodium falciparum. Chapter 5 attempts to improve the coverage of the repertoire of metabolic proteins in P.falciparum while in Chapter 6 interactions and pathways prevalent in different stages in the life cycle of the parasite are deciphered and discussed. Apart from functionally interacting proteins in metabolic pathways, physically and transiently interacting proteins have been analyzed and discussed in Chapters 7 and 8. In Chapter 5, metabolic proteins participating in pathways in Plasmodium falciparum have been analyzed. P.falciparum is the causative agent of malaria, a disease which affects large populations in the subtropical regions. P.falciparum genome is atypical and is rich in Adenine/Thymine pairs, and there is presence of large stretches of amino acid repeats encoded in protein coding regions. Various sequence-related features of P.falciparum proteins when compared with those of other organisms show extensive divergence. All of these have made reliable function prediction, by homology to other proteins with known functions, daunting. Like other proteins in P.falciparum, metabolic proteins have also diverged significantly from their functional counterparts in model eukaryotes such as yeast. Metabolic pathways play an important role in the survival of the organism and hence are amenable towards the identification of proteins susceptible to drugs, thereby combating pathogenesis. Chapter 5 of the thesis aims at furthering knowledge pertaining to metabolic proteins by first quantifying the extent of divergence observed in the already characterized metabolic proteins. This knowledge is further used in identification of potential metabolic proteins which are not identified as proteins involved in metabolic pathways by other annotation efforts undertaken for P.falciparum. In the first part of the chapter, the extent of divergence in the sequences of metabolic proteins in P.falciparum has been determined by comparing the P.falciparum proteins with their functional counterparts from 34 completely sequenced unicellular eukaryotic organisms. Comparison of domain architectures between the P.falciparum proteins with their functional counterparts reveals that in nearly 54% of metabolic pathways, proteins show nearly the same domain architecture as the other functional counterparts. Inversion, deletion and duplication of domains are observed in rest of the proteins. Further analysis reveals that P.falciparum proteins are longer than their functional counterparts. It was also observed in nearly 15% of the cases, the domains are characterized by the presence of large non-conserved or plasmodium genus specific inserts within the domain assigned regions. There is also prevalence of unassigned regions in the N- and C- terminal regions in P.falciparum proteins when compared with their functional counterparts. Finally, it was also observed that metabolic proteins of P.falciparum show significantly low sequence similarity when compared with other functional counterparts. From this analysis, it can be clearly seen that metabolic proteins of P.falciparum have significantly diverged from such proteins in other organisms, thus making function prediction by homology very difficult. There are several steps in metabolic pathways in P.falciparum which are expected to be active based on experimental analysis. However, some of these proteins with expected functions have not been identified so far. One of the reasons for this apparent incompleteness is the high divergence observed in the metabolic proteins of P. falciparum. To overcome this limitation, in the second part of the chapter, a sensitive approach based on domain family assignment (MulPSSM), developed in-house, has been used to identify proteins which are potentially involved in metabolic pathways. The approach is based on reverse PSI–BLAST, where multiple sequence profiles for each family are used to search against sequence databases. This approach has been shown to be better or at-par with other remote homology detection procedures. Using this approach, 15 P. falciparum proteins have been identified which can potentially function as metabolic proteins and were not characterized in P.falciparum so far. All the proteins identified by the approach show low sequence similarity to other well characterized proteins and contain significant fractions of unassigned regions thus, making function recognition non-trivial. Supporting literature and other data is provided to demonstrate the robustness of the homology-based annotation of the identified pathway proteins. Chapter 6 is an analysis of the dynamic changes occurring in the metabolic network of P.falciparum during its life cycle. In this chapter, two aspects of P. falciparum metabolic proteins have been integrated and analyzed. First, the dataset of protein-protein interactions derived from experimental studies and second, the datasets of microarray analysis providing information on stage specific expression of P. falciparum genes corresponding to the metabolic proteins. As a first step, protein-protein interaction information for the metabolic proteins was gathered. A total of 810 interactions have been obtained, where one or both proteins are involved in a pathway. Subsequently, these interactions were compared with 14070 interactions involving metabolic proteins from free-living and non-pathogenic unicellular eukaryote yeast. Comparison across the two organisms shows wide discrepancy in the number of proteins involved in interactions and also the pathways in which they participate. Out of the 810 interactions in P.falciparum, 173 are found uniquely in plasmodium where both or one of the protein have no identifiable homolog in yeast. Insufficient sampling of interactions made by proteins in P.falciparum in comparison to yeast, is one of the reasons for the observed discrepancy. However, the differences due to the parasitic lifestyle of P.falciparum could also be a potential reason. Further analysis of the protein-protein interactions by the metabolic proteins revealed that a large fraction of interactions are made between a metabolic protein and a non-metabolic protein. For instance, interaction observed between glycolytic protein phospoglycerate kinase with MAP kinase. This trend is observed in both plasmodium and yeast where 65% and 77% of the interactions, respectively, involve proteins not directly participating in metabolic pathways. Further, interactions between proteins belonging to different pathways and lastly, interactions between proteins in the same pathway are uncovered. All of these interactions depict the different modes by which metabolic pathways are regulated through protein-protein interactions. Another aspect explored in this analysis is the stage specific expression of genes encoding these metabolic proteins. The analysis is especially relevant in the parasite because its entire life cycle is divided into seven distinct stages. Upon integrating the protein-protein interactions with the gene expression data, it became apparent that the trophozoite, schizont and gametocyte stages show large fractions of co-expressed genes encoding proteins involved in protein-protein interactions within metabolic pathways. The high preponderance of co-expressed genes encoding for interacting protein pairs in these stages is also consistent with metabolic requirement of plasmodium in the various stages. Glycolytic pathway is central to energy production in the parasite and is discussed at length in this chapter. Members of this pathway are involved in interactions with other glycolytic proteins (9 such interactions), they also interact with proteins involved in other pathways (30 interactions) and with proteins not involved directly in any metabolic pathway (75 interactions). Nearly 70% of the interactions made by the glycolytic proteins are encoded by genes found to be co-expressed across the various stages. Integration of gene expression data along with protein-protein interaction information for metabolic pathways such as the glycolytic pathway thus, highlights the complex mode of regulation underlying this pathway. The analysis carried out in this chapter emphasizes on the intricacies involved in the regulation of metabolic proteins in P.falciparum. Chapter 7 describes an in-depth analysis carried out to understand the basis for interaction specificity between small monomeric GTPases and their regulators, the Guanine nucleotide Exchange Factors (GEFs). Monomeric GTPases are involved in binding to guanine nucleotide. These proteins can bind to both GTP and GDP. However, transition from GDP bound to GTP bound form occurs with large conformational changes and requires binding of the GEFs. The conformational changes that arise due to the nucleotide exchange are required for the GTPases to bind to its various effectors. For the analysis carried out in Chapter 7, GTPases belonging to the Ras superfamily have been considered. The superfamily is further subdivided into 5 distinct families based on their functions. The 5 families are Ras, Ran, Rab, Arf and Rho. Members belonging to each of these families are involved in a wide array of cellular processes such as signaling and cytoskeletal remodeling. Members of each of these GTPase families bind to structurally distinct GEFs, and in some cases, multiple GEFs are involved in nucleotide exchange within a family. It is intriguing therefore, to understand how GTPases belonging to the same structural family maintain specificity across the highly dissimilar GEFs and this forms the main objective of this analysis. So far, 13 distinct complexes between GTPases and their cognate GEFs have been solved using X-ray crystallography. This set of structural complexes forms the starting point of the analysis. As a first step, pairwise structural comparison of the interfaces has made between various pairs of complex structures. Based on these comparisons, it is apparent that most of the interfaces in the GTPase and GEF complexes comprise of residue positions which are topologically not equivalent suggesting different modes of binding across these complexes. Further analysis was carried out to probe the extent of specificity underlying these complexes. This is achieved by determining interface residues which are found to be conserved in a family specific manner. Such residue positions have been obtained by using a statistically robust algorithm Contrast Hierarchical Alignment and Interaction Network (CHAIN) that extracts sequence patterns most distinguishing two sets of homologous sequences. The analysis indicated the presence of family specific residues at the GTPase and GEF interface. Such residues could be implicated in maintaining the specific interactions between the GTPases and the GEFs. The robustness in the specificity of the interactions was further interrogated by providing an energetic basis to the specificity in the interactions mediated by the cognate GTPases and the GEFs and also understanding how crosstalk is prevented across the non-cognate complexes. For each of the 13 cognate complexes, empirical interaction energies have been estimated using FoldX. The interaction energy is compared to non-cognate complexes which are obtained by swapping the interface residues of the cognate GTPase with the non-cognate GTPase residues. For most of the complexes, it was observed that the interaction energies for the cognate complexes are much lower than the non-cognate complexes. Energy values across the non-cognate complexes are usually indicative of reduced stability, thereby precluding such interactions from occurring. Such large energy differences between cognate and non-cognate interactions arise due to drastic substitutions at the interface patch due to difference in the charge or other stereochemical aspects of the amino acids. Both evolutionary and energy based analysis indicates the presence and importance of few family specific residues in the cognate complexes and also the presence of unfavorable residues in the non-cognate complexes thus preventing crosstalk. However, apart from changes at the interfaces, many positions outside the interface also undergo changes across the various homologs within the same family/subfamily of GTPase. Coevolutionary analysis of GTPase and GEFs from multiple eukaryotic organisms has been carried out in these complexes and it was observed that most of the coevolving positions are not found at the interface. Many of these residue positions are near the active site or near the interface. Identification of such coevolving positions, where residue variations in the GTPase are strongly coupled to the GEF, may provide initial clues to the possible allosteric path adopted in connecting the binding of GEF to the vast structural changes observed during GTP exchange in GTPases. Thus, the analysis provides a comprehensive framework to understand how interaction specificity has evolved between the GTPase and GEF complexes. Chapter 8 discusses another example of transient protein-protein interaction observed between proteins implicated in signaling process in Dictyostelium discoideum. The work reported in this chapter was carried out in collaboration with Prof. Nanjundaiah and coworkers from Molecular Reproduction and Developmental Genetics department, Indian Institute of Science. All the experimental analyses mentioned in this chapter were carried out by Prof. Nanjundaiah and coworkers and the author carried out all the computational analysis. Experimental analysis indicated the presence of a ribosomal protein S4 in D. discoideum which mediates interactions with CDC24 and CDC42. The protein is speculated to be a functional analog of yeast scaffolding protein Bem1. However, the exact structural and sequence features of the protein which can accommodate its non-ribosomal function as a scaffold by mediating protein-protein interactions are not clearly understood. With the aid of structural modeling, a 3-D structure was generated for the C-terminal regions of D. discoideum protein S4. The modeled structure, as in the template used for modelling, resembled the fold of SH3 domain which has been shown to be involved in protein-protein interactions. Structural and sequence analyses were carried out to evaluate the potential mode by which interactions could be mediated by this protein. The hypothesis generated was further corroborated by experimental analysis. Thus, both experimental and computational analysis provide evidence for the functional role of the ribosomal protein S4 from Dictyostelium discoideum as a scaffold. Chapter 9 summarizes the conclusions reached in various chapters of the thesis. The thesis embodies analyses probing various aspects of functional interactions between proteins. A frame work has been provided to elucidate functional interactions using tethered domain families in multidomain proteins. Further, the role of these functional interactions have been explored in different scenarios by exhaustively analyzing metabolic proteins and their regulation in pathogenic organism Plasmodium falciparum and by also analyzing two distinct types of transient protein-protein interactions. Proteins - Functional Interactions Multidomain Proteins Functionlly Interacting Proteins Tethered Domain Families Protein-Protein Interactions Protein Analysis Metabolic Proteins Ribosomal Protein Biochemistry
6	An investigation into dynamic and functional properties of prokaryotic signalling networks Kothamachu, Varun Bhaskar January 2016 (has links) In this thesis, I investigate dynamic and computational properties of prokaryotic signalling architectures commonly known as the Two Component Signalling networks and phosphorelays. The aim of this study is to understand the information processing capabilities of different prokaryotic signalling architectures by examining the dynamics they exhibit. I present original investigations into the dynamics of different phosphorelay architectures and identify network architectures that include a commonly found four step phosphorelay architecture with a capacity for tuning its steady state output to implement different signal-response behaviours viz. sigmoidal and hyperbolic response. Biologically, this tuning can be implemented through physiological processes like regulating total protein concentrations (e.g. via transcriptional regulation or feedback), altering reaction rate constants through binding of auxiliary proteins on relay components, or by regulating bi-functional activity in relays which are mediated by bifunctional histidine kinases. This study explores the importance of different biochemical arrangements of signalling networks and their corresponding response dynamics. Following investigations into the significance of various biochemical reactions and topological variants of a four step relay architecture, I explore the effects of having different types of proteins in signalling networks. I show how multi-domain proteins in a phosphorelay architecture with multiple phosphotransfer steps occurring on the same protein can exhibit multistability through a combination of double negative and positive feedback loops. I derive a minimal multistable (core) architecture and show how component sharing amongst networks containing this multistable core can implement computational logic (like AND, OR and ADDER functions) that allows cells to integrate multiple inputs and compute an appropriate response. I examine the genomic distribution of single and multi domain kinases and annotate their partner response regulator proteins across prokaryotic genomes to find the biological significance of dynamics that these networks embed and the processes they regulate in a cell. I extract data from a prokaryotic two component protein database and take a sequence based functional annotation approach to identify the process, function and localisation of different response regulators as signalling partners in these networks. In summary, work presented in this thesis explores the dynamic and computational properties of different prokaryotic signalling networks and uses them to draw an insight into the biological significance of multidomain sensor kinases in living cells. The thesis concludes with a discussion on how this understanding of the dynamic and computational properties of prokaryotic signalling networks can be used to design synthetic circuits involving different proteins comprising two component and phosphorelay architectures. 571.7

1

Page generated in 0.0678 seconds