1 |
Development and Validation of a Structure-Based Computational Method for the Prediction of Protein Specificity ProfilesGagnon, Olivier 23 September 2019 (has links)
Post-translational modification (PTM) of proteins by enzymes such as methyltransferases, kinases and deacetylases play a crucial role in the regulation of many metabolic pathways. Determining the substrate scope of these enzymes is essential when studying their biological role. However, the combinatorial nature of possible protein substrate sequences makes experimental screening assays intractable. To predict new substrates for proteins, various computational approaches have been developed. Our method relies on crystallographic data and a novel multistate computational protein design algorithm. We previously used our method to successfully predict four new substrates for SMYD2 (Lanouette S & Davey J.A., 2015), doubling the number of known targets for this PTM enzyme that has been difficult to characterize using other methods. This was possible by first extracting a specificity profile of Smyd2 using our algorithm and subsequently screening a peptide library for matching sequences. However, our method did not yield successful results when attempting to reproduce specificity profiles of other proteins (64% accuracy on average). Different protein environments have demonstrated limitations in the methodology and lead us to further develop the algorithm on a more thorough dataset. Using our new optimized method, specificity profile predictions increase by roughly 20% (84% accuracy on average), independent of the structural template used. The algorithm was then used to blindly predict a specificity profile for the methyltransferase Smyd3, an enzyme for which limited data is currently available. A library of 2550 peptides was screened with the predicted profile, yielding 123 matching sequences. We randomly chose 64 for experimental validation (SPOT peptide array) of methylation by Smyd3 and found 45 methylated and 19 non-methylated peptides (70% success rate). Finally, we released to the community a web version of the algorithm, which can be accessed as http://viper.science.uottawa.ca.
|
2 |
Protein Engineering for Biosensor DevelopmentMiklos, Aleksandr 24 November 2008 (has links)
<p>Biosensors incorporating proteins as molecular recognition elements for analytes are used in clinical diagnostics, as biological research tools, and to detect chemical threats and pollutants. This work describes the application of protein engineering techniques to address three aspects in the design of protein-based biosensors; the transduction of binding into an observable, the manipulation of affinities, and the diversification of specificities. The periplasmic glucose-binding protein from the hyperthermophile Thermotoga maritima (tmGBP) was fused with green fluorescent protein variants to construct a fluorescent ratiometric sensor that is sufficiently robust to detect glucose up to 67°C. Ligand-binding affinities of tmGBP were changed by altering a C-terminal helical domain that tunes ligand binding affinity through conformational coupling effects. This method was extended to the Escherichia coli arabinose-binding protein. Computational design techniques were used to diversify the specificity of the E. coli maltose-binding protein (ecMBP) to bind ibuprofen, a non-steroidal antiinflammatory drug. These designs ranged in affinity from 0.24 to 0.8 mM and function as reagentless fluorescent sensors. The ligand affinities of ecMBP are tuned by complex interactions that control conformational coupling. These experiments demonstrate that long-range conformational effects as well as molecular recognition interactions need to be considered in the design of high-affinity receptors.</p> / Dissertation
|
3 |
Engineering of Multi-Substrate Enzyme Specificity and Conformational Equilibrium Using Multistate Computational Protein DesignSt-Jacques, Antony D. 19 December 2018 (has links)
The creation of enzymes displaying desired substrate specificity is an important objective of enzyme engineering. To help achieve this goal, computational protein design (CPD) can be used to identify sequences that can fulfill interactions required to productively bind a desired substrate. Standard CPD protocols find optimal sequences in the context of a single state, for example an enzyme structure with a single substrate bound at its active site. However, many enzymes catalyze reactions requiring them to bind multiple substrates during successive steps of the catalytic cycle. The design of multi-substrate enzyme specificity requires the ability to evaluate sequences in the context of multiple substrate-bound states because mutations designed to enhance activity for one substrate may be detrimental to the binding of a second substrate. Additionally, many enzymes undergo conformational changes throughout their catalytic cycle and the equilibrium between these conformations can have an impact on their substrate specificity. In this thesis, I present the development and implementation of two multistate computational protein design methodologies for the redesign of multi-substrate enzyme specificity and the modulation of enzyme conformational equilibrium. Overall, our approaches open the door to the design of multi-substrate enzymes displaying tailored specificity for any biocatalytic application.
|
4 |
Multistate Computational Protein Design: Theories, Methods, and ApplicationsDavey, James A. January 2016 (has links)
Traditional computational protein design (CPD) calculations model sequence perturbations and evaluate their stabilities using a single fixed protein backbone template in an approach referred to as single‐state design (SSD). However, certain design objectives require the explicit consideration of multiple conformational states. Cases where a multistate framework may be advantageous over the single‐state approach include the computer aided discovery of new enzyme substrates, the prediction of protein stabilities, and the design of protein dynamics. These design objectives can be tackled using multistate design (MSD). However, it is often the case
that a design objective requires the consideration of a protein state having no available structure information. For such circumstances the multistate framework cannot be applied. In this thesis I present the development of two template and ensemble preparation methodologies and their application to three projects. The purpose of which is to demonstrate the necessary ensemble modeling strategies to overcome limitations in available structure information. Particular emphasis is placed on the ability to recapitulate experimental data to guide modelling of the design space. Specifically, the use of MSD allowed for the accurate prediction of a methyltransferase recognition motif and new substrates, the prediction of mutant sequence stabilities with quantitative accuracy, and the design of dynamics into the rigid Gβ1 scaffold producing a set of dynamic variants whose tryptophan residue exchanges between two conformations on the millisecond timescale. Implementation of both the ensemble, coordinate perturbation followed by energy minimization (PertMin), and template, rotamer optimization followed by energy minimization (ROM), generation protocols developed here allow for exploration and manipulation of the structure space enabling the success of these applications.
|
5 |
Computational approaches toward protein design / Approches computationnelles pour le design de protéinesTraore, Seydou 23 October 2014 (has links)
Le Design computationnel de protéines, en anglais « Computational Protein Design » (CPD), est un champ derecherche récent qui vise à fournir des outils de prédiction pour compléter l'ingénierie des protéines. En effet,outre la compréhension théorique des propriétés physico-chimiques fondamentales et fonctionnelles desprotéines, l’ingénierie des protéines a d’importantes applications dans un large éventail de domaines, y comprisdans la biomédecine, la biotechnologie, la nanobiotechnologie et la conception de composés respectueux del’environnement. Le CPD cherche ainsi à accélérer le design de protéines dotées des propriétés désirées enpermettant le traitement d’espaces de séquences de large taille tout en limitant les coûts financier et humain auniveau expérimental.Pour atteindre cet objectif, le CPD requière trois ingrédients conçus de manière appropriée: 1) une modélisationréaliste du système à remodeler; 2) une définition précise des fonctions objectives permettant de caractériser lafonction biochimique ou la propriété physico-chimique cible; 3) et enfin des méthodes d'optimisation efficacespour gérer de grandes tailles de combinatoire.Dans cette thèse, nous avons abordé le CPD avec une attention particulière portée sur l’optimisationcombinatoire. Dans une première série d'études, nous avons appliqué pour la première fois les méthodesd'optimisation de réseaux de fonctions de coût à la résolution de problèmes de CPD. Nous avons constaté qu’encomparaison des autres méthodes existantes, nos approches apportent une accélération du temps de calcul parplusieurs ordres de grandeur sur un large éventail de cas réels de CPD comprenant le design de la stabilité deprotéines ainsi que de complexes protéine-protéine et protéine-ligand. Un critère pour définir l'espace demutations des résidus a également été introduit afin de biaiser les séquences vers celles attendues par uneévolution naturelle en prenant en compte des propriétés structurales des acides aminés. Les méthodesdéveloppées ont été intégrées dans un logiciel dédié au CPD afin de les rendre plus facilement accessibles à lacommunauté scientifique. / Computational Protein Design (CPD) is a very young research field which aims at providing predictive tools to complementprotein engineering. Indeed, in addition to the theoretical understanding of fundamental properties and function of proteins,protein engineering has important applications in a broad range of fields, including biomedical applications, biotechnology,nanobiotechnology and the design of green reagents. CPD seeks at accelerating the design of proteins with wanted propertiesby enabling the exploration of larger sequence space while limiting the financial and human costs at experimental level.To succeed this endeavor, CPD requires three ingredients to be appropriately conceived: 1) a realistic modeling of the designsystem; 2) an accurate definition of objective functions for the target biochemical function or physico-chemical property; 3)and finally an efficient optimization framework to handle large combinatorial sizes.In this thesis, we addressed CPD problems with a special focus on combinatorial optimization. In a first series of studies, weapplied for the first time the Cost Function Network optimization framework to solve CPD problems and found that incomparison to other existing methods, it brings several orders of magnitude speedup on a wide range of real CPD instancesthat include the stability design of proteins, protein-protein and protein-ligand complexes. A tailored criterion to define themutation space of residues was also introduced in order to constrain output sequences to those expected by natural evolutionthrough the integration of some structural properties of amino acids in the protein environment. The developed methods werefinally integrated into a CPD-dedicated software in order to facilitate its accessibility to the scientific community.
|
6 |
Investigating Different Rational Design Approaches to Increase Brightness in Red Fluorescent ProteinsLegault, Sandrine 27 September 2021 (has links)
Red fluorescent proteins (RFPs) are used extensively in biological research because their longer emission wavelengths are less phototoxic and allow deeper imaging of animal tissue. However, far-red RFPs generally display low brightness, emphasizing the need to develop brighter variants. Here, we investigate three approaches to rigidify the RFP chromophore to increase the quantum yield, and thereby brightness. We first used computational protein design on a maturation-efficient mRojo-VHSV variant previously engineered in our lab to introduce a Superdecker motif, a parallel pi-stack comprising aromatic residue side chains and the phenolate moiety of the chromophore, which we hypothesized would enhance chromophore packing and reduce non-radiative decay. The best mutants identified showed up to 1.7-fold higher quantum yield at pH 9, relative to their parent protein. We next postulated that brightness could be further increased by rigidifying the chromophore via branched aliphatic residues. Computational protein design was performed on a dim mCherry variant, mRojoA, followed by directed evolution on the brightest mutant. The combination of these methodologies yielded mSandy2, the brightest Discosoma-derived monomeric RFP with an emission maximum above 600 nm. Finally, we aimed to increase brightness by focusing on positions where residue rigidity correlated to quantum yield in mCherry-related RFPs according to NMR data that had been previously acquired in our lab. Combinatorial site-saturation mutagenesis was performed on two different surface patches of mCherry at positions 144/145/198 and 194/196/220. Our results demonstrated that surface residues may not be adequate targets for this approach. Altogether, the work herein presents unique rational design methodologies that can be used to increase brightness in RFPs.
|
7 |
Défis algorithmiques pour les simulations biomoléculaires et la conception de protéines / Algorithmic challenges for biomolecular simulations and protein designDruart, Karen 05 December 2016 (has links)
Le dessin computationnel de protéine, ou CPD, est une technique qui permet de modifier les protéines pour leur conférer de nouvelles propriétés, en exploitant leurs structures 3D et une modélisation moléculaire. Pour rendre la méthode de plus en plus prédictive, les modèles employés doivent constamment progresser. Dans cette thèse, nous avons abordé le problème de la représentation explicite de la flexibilité du squelette protéique. Nous avons développé une méthode de dessin "multi-états", qui se base sur une bibliothèque discrète de conformations du squelette, établie à l'avance. Dans un contexte de simulation Monte Carlo, le paysage énergétique d'une protéine étant rugueux, les changements de squelettes ne peuvent etre acceptés que moyennant certaines précautions. Aussi, pour explorer ces conformations, en même temps que des mutations et des mouvements de chaînes latérales, nous avons introduit un nouveau type de déplacement dans une méthode Monte Carlo existante. Il s'agit d'un déplacement "hybride", où un changement de squelette est suivi d'une courte relaxation Monte Carlo des chaînes latérales seules, après laquelle un test d'acceptation est effectué. Pour respecter une distribution de Boltzmann des états, la probabilité doit avoir une forme précise, qui contient une intégrale de chemin, difficile à calculer en pratique. Deux approximations sont explorées en détail: une basée sur un seul chemin de relaxation, ou chemin "générateur" (Single Path Approximation, ou SPA), et une plus complexe basée sur un ensemble de chemins, obtenus en permutant les étapes élémentaires du chemin générateur (Permuted Path Approximation, ou PPA). Ces deux approximations sont étudiées et comparées sur deux protéines. En particulier, nous calculons les énergies relatives des conformations du squelette en utilisant trois méthodes différentes, qui passent réversiblement d'une conformation à l'autre en empruntent des chemins très différents. Le bon accord entre les méthodes, obtenu avec de nombreuses paramétrisations différentes, montre que l'énergie libre se comporte bien comme une fonction d'état, suggérant que les états sont bien échantillonnés selon la distribution de Boltzmann. La méthode d'échantillonnage est ensuite appliquée à une boucle dans le site actif de la tyrosyl-ARNt synthétase, permettant d'identifier des séquences qui favorisent une conformation, soit ouverte, soit fermée de la boucle, permettant en principe de contrôler ou redessiner sa conformation. Nous décrivons enfin un travail préliminaire visant à augmenter encore la flexibilité du squelette, en explorant un espace de conformations continu et non plus discret. Ce changement d'espace oblige à restructurer complètement le calcul des énergies et le déroulement des simulations, augmente considérable le coût des calculs, et nécessite une parallélisation beaucoup plus agressive du logiciel de simulation. / Computational protein design is a method to modify proteins and obtain new properties, using their 3D structure and molecular modelling. To make the method more predictive, the models need continued improvement. In this thesis, we addressed the problem of explicitly representing the flexibility of the protein backbone. We developed a "multi-state" design approach, based on a small library of backbone conformations, defined ahead of time. In a Monte Carlo framework, given the rugged protein energy landscape, large backbone motions can only be accepted if precautions are taken. Thus, to explore these conformations, along with sidechain mutations and motions, we have introduced a new type of Monte Carlo move. The move is a "hybrid" one, where the backbone changes its conformation, then a short Monte Carlo relaxation of the sidechains is done, followed by an acceptation test. To obtain a Boltzmann sampling of states, the acceptation probability should have a specific form, which involves a path integral that is difficult to calculate. Two approximate forms are explored: the first is based on a single relaxation path, or "generating path" (Single Path Approximation or SPA). The second is more complex and relies on a collection of paths, obtained by shuffling the elementary steps of the generating path (Permuted Path Approximation or PPA). These approximations are tested in depth and compared on two proteins. Free energy differences between the backbone conformations are computed using three different approaches, which move the system reversibly from one conformation to another, but follow very different routes. Good agreement is obtained between the methods and a wide range of parameterizations, indicating that the free energy behaves as a state function, as it should, and strongly suggesting that Boltzmann sampling is verified. The sampling method is applied to the tyrosyl-tRNA synthetase enzyme, allowing us to identify sequences that prefer either an open or a closed conformation of an active site loop, so that in principle we can control, or design the loop conformation. Finally, we describe preliminary work to make the protein backbone fully flexible, moving within a continuous and not a discrete space. This new conformational space requires a complete reorganization of the energy calculation and Monte Carlo simulation scheme, increases simulation cost substantially, and requires a much more aggressive parallelization of our software.
|
8 |
Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and ApplicationsMudgal, Richa January 2015 (has links) (PDF)
Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known
structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold
and biological function of the two proteins would be similar. Homology based information
transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based
searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based
homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like
sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown
structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional
mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks.
Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for
protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis.
Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives.
The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through
SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/.
Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes.
The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.
Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives.
The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through
SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/.
Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes.
The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.
|
9 |
Characterization of the Protein Lysine Methyltransferase SMYD2Lanouette, Sylvain January 2015 (has links)
Our understanding of protein lysine methyltransferases and their substrates remains limited despite their importance as regulators of the proteome. The SMYD (SET and MYND domain) methyltransferase family plays pivotal roles in various cellular processes, including transcriptional regulation and embryonic development. Among them, SMYD2 is associated with oesophageal squamous cell carcinoma, bladder cancer and leukemia as well as with embryonic development. Initially identified as a histone methyltransferase, SMYD2 was later reported to methylate p53, the retinoblastoma protein pRb and the estrogen receptor ERalpha and to regulate their activity. Our proteomic and biochemical analyses demonstrated that SMYD2 also methylates the molecular chaperone HSP90 on K209 and K615. We also showed that HSP90 methylation is regulated by HSP90 co-chaperones, pH, and the demethylase LSD1. Further methyltransferase assays demonstrated that SMYD2 methylates lysine K* in proteins which include the sequence [LFM]-₁-K*-[AFYMSHRK]+₁-[LYK]+₂. This motif allowed us to show that SMYD2 methylates the transcriptional co-repressor SIN3B, the RNA helicase DHX15 and the myogenic transcription factors SIX1 and SIX2. Finally, muscle cell models suggest that SMYD2 methyltransferase activity plays a role in preventing premature myogenic differentiation of proliferating myoblasts by repressing muscle-specific genes. Our work thus shows that SMYD2 methyltransferase activity targets a broad array of substrates in vitro and in situ and is regulated by intricate mechanisms.
|
10 |
Computational protein design : un outil pour l'ingénierie des protéines et la biologie synthétique / Computational protein design : a tool for protein engineering and synthetic biologyMignon, David 20 December 2017 (has links)
Le « Computational protein design » ou CPD est la recherche des séquences d’acides aminés compatibles avec une structure protéique ciblée. L’objectif est de concevoir une fonction nouvelle et/ou d’ajouter un nouveau comportement. Le CPD est en développement dans de notre laboratoire depuis plusieurs années, avec le logiciel Proteus qui a plusieurs succès à son actif.Notre approche utilise un modèle énergétique basé sur la physique et s’appuie sur la différence d’énergie entre l’état plié et l’état déplié de la protéine. Au cours de cette thèse, nous avons enrichi Proteus sur plusieurs points, avec notamment l’ajout d’une méthode d’exploration Monte Carlo avec échange de répliques ou REMC. Nous avons comparé trois méthodes stochastiques pour l’exploration de l’espace de la séquence : le REMC, le Monte Carlo simple et une heuristique conçue pour le CPD, le «Multistart Steepest Descent » ou MSD. Ces comparaisons portent sur neuf protéines de trois familles de structures : SH2, SH3 et PDZ. En utilisant les techniques d’exploration ci-dessus, nous avons été en mesure d’identifier la conformation du minimum global d’énergie ou GMEC pour presque tous les tests dans lesquels jusqu’à 10 positions de la chaîne polypeptidique étaient libres de muter (les autres conservant leurs types natifs). Pour les tests avec 20 positions libres de muter, le GMEC a été identifié dans 2/3 des cas. Globalement, le REMC et le MSD donnent de très bonnes séquences en termes d’énergie, souvent identiques ou très proches du GMEC. Le MSD a obtenu les meilleurs résultats sur les tests à 30 positions mutables. Le REMC avec huit répliques et des paramètres optimisés a donné le plus souvent le meilleur résultat lorsque toutes les positions peuvent muter. De plus, comparé à une énumération exacte des séquences de faible énergie, le REMC fournit un échantillon de séquences de grande diversité.Dans la seconde partie de ce travail, nous avons testé notre modèle pour la conception de domaines PDZ. Pour l’état plié,nous avons utilisé deux variantes d’un modèle de solvant GB. La première utilise une frontière diélectrique protéine/solvant effective moyenne ; la seconde, plus rigoureuse, utilise une frontière exacte qui fluctue le long de la trajectoire MC. Pour caractériser l’état déplié, nous utilisons un ensemble de potentiels chimiques d’acide aminé ou énergies de références. Ces énergies de références sont déterminées par maximisation d’une fonction de vraisemblance afin de reproduire les fréquences d’acides aminés des domaines PDZ naturels. Les séquences conçues par Proteus ont été comparées aux séquences naturelles. Nos séquences sont globalement similaires aux séquences Pfam, au sens des scoresBLOSUM40, avec des scores particulièrement élevés pour les résidus au cœur de la protéine. La variante de GB la plus rigoureuse donne toujours des séquences similaires à des homologues naturels modérément éloignés et l’outil de reconnaissance de plis Super family appliqué à ces séquences donne une reconnaissance parfaite. Nos séquences ont également été comparées à celles du logiciel Rosetta. La qualité, selon les mêmes critères que précédemment, est très comparable, mais les séquences Rosetta présentent moins de mutations que les séquences Proteus. / Computational Protein Design, or CPD is the search for the amino acid sequences compatible with a targeted protein structure. The goal is to design a new function and/or add a new behavior. CPD has been developed in our laboratory for several years, with the software Proteus which has several successes to its credit. Our approach uses a physics-based energy model, and relies on the energy difference between the folded and unfolded states of the protein. During this thesis, we enriched Proteus on several points, including the addition of a Monte Carlo exploration method with Replica Exchange or REMC. We compared extensively three stochastic methods for the exploration of sequence space: REMC, plain Monte Carlo and a heuristic designed for CPD: Multistart Steepest Descent or MSD.These comparisons concerned nine proteins from three structural families: SH2, SH3 and PDZ. Using the exploration techniques above, we were able to identify the Global Minimum EnergyConformation, or GMEC for nearly all the test cases where up to10 positions of the polypeptide chain were free to mutate (the others retaining their native types). For the tests where 20positions were free to mutate, the GMEC was identified in 2/3 of the cases. Overall, REMC and MSD give very good sequences in terms of energy, often identical or very close to the GMEC. MSDperformed best in the tests with 30 mutating positions. REMCwith eight replicas and optimized parameters often gave the best result when all positions could mutate. Moreover, compared to an exact enumeration of the low energy sequences, REMC provided a sample of sequences with a high sequence diversity.In the second part of this work, we tested our CPD model forPDZ domain design. For the folded state, we used two variants ofa GB solvent model. The first used a mean, effective protein/solvent dielectric boundary; the second one, more rigorous, used an exact boundary that flucutated over the MCtrajectory. To characterize the unfolded state, we used a set of amino acid chemical potentials or reference energies. These reference energies were determined by maximizing a likelihoodfunction so as to reproduce the amino acid frequencies in naturalPDZ domains. The sequences designed by Proteus were compared to the natural sequences. Our sequences are globally similar to the Pfam sequences, in the sense of the BLOSUM40scores, with especially high scores for the residues in the core ofthe protein. The more rigorous GB variant always gives sequences similar to moderately distant natural homologues and perfect recognition by the the Super family fold recognition tool.Our sequences were also compared to those produced by the Rosetta software. The quality, according to the same criteria as before, was very similar, but the Rosetta sequences exhibit fewer mutations than the Proteus sequences.
|
Page generated in 0.1624 seconds