Spelling suggestions: "subject:"dequence space"" "subject:"1sequence space""
1 |
Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and ApplicationsMudgal, Richa January 2015 (has links) (PDF)
Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known
structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold
and biological function of the two proteins would be similar. Homology based information
transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based
searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based
homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like
sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown
structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional
mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks.
Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for
protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis.
Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives.
The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through
SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/.
Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes.
The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.
Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives.
The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through
SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/.
Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes.
The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.
|
2 |
Tirer profit de l’espace de séquence : une approche multidisciplinaire pour élucider l’évolution d’une famille d’enzymes primitivesLemay-St-Denis, Claudèle 01 1900 (has links)
L’habileté des enzymes à évoluer joue un rôle fondamental dans l'adaptation des organismes à leur environnement, leur permettant de s'adapter aux changements de température, aux nutriments disponibles ou encore à l'introduction de composés cytotoxiques. Au cours des dernières décennies, cette capacité a conduit à l'émergence rapide de mécanismes de résistance aux antibiotiques chez des bactéries pathogènes pour l’humain, notamment dans le cas de l'antibiotique synthétique triméthoprime. Dix ans après l'introduction de cet antibiotique, l'enzyme dihydrofolate réductase de type B (DfrB) a été identifiée comme conférant une résistance aux bactéries l'exprimant en catalysant par voie d’enzyme alternative la réaction inhibée par l’antibiotique.
Des études structurales, cinétiques et mécanistiques de la DfrB en ont révélé la nature atypique, et suggèrent que cette enzyme est un modèle d’enzyme primitive. En particulier, son site actif unique est formé via l’interface de quatre protomères identiques. Puisque les DfrB ne sont pas apparentées sur le plan évolutif à des protéines connues et caractérisées, on ne connait pas comment elles ont évolué pour ultimement contribuer à la résistance au triméthoprime, et en particulier comment leur capacité catalytique a émergé au sein du petit domaine codé par leurs gènes. Ainsi, cette thèse vise à approfondir notre compréhension de l’évolution des enzymes en examinant spécifiquement l’évolution des DfrB et les propriétés qui ont guidé ce processus.
Puisque les gènes des DfrB ont rarement été rapportés, je présente d’abord nos efforts déployés pour identifier et caractériser de manière génomique les DfrB dans les bases de données publiques. Ces efforts ont conduit à la découverte, pour la première fois, de DfrB en dehors du contexte clinique. Nous avons ensuite caractérisé, sur le plan biophysique et enzymatique, des homologues protéiques aux DfrB que nous avons identifiés dans des bases de données de protéines putatives. Nous avons démontré la capacité d’homologues identifiés dans des contextes environnementaux, non associés aux activités humaines, à catalyser la réduction du dihydrofolate de la même façon que les DfrB. Enfin, une large exploration d’homologues de séquence, suivie d'une caractérisation expérimentale et computationnelle, nous a permis d'identifier des homologues distants des DfrB, certains capables de procurer une résistance au triméthoprime, et d'autres dépourvus de cette capacité. Ces résultats nous ont permis de proposer un modèle expliquant l’émergence de l'activité catalytique au sein du domaine protéique des DfrB.
En résumé, cette thèse présente une approche multidisciplinaire pour l’exploration et la caractérisation de l’espace de séquence d’une famille de protéines. Cette approche, qui comprend des analyses génomiques, enzymologiques, biophysiques et bio-informatiques, nous a permis d’identifier les caractéristiques structurales et de séquences nécessaires à la formation d’une enzyme DfrB fonctionnelle. Nous avons également proposé un modèle pour expliquer l’évolution de cette enzyme primitive. Dans l’ensemble, nos résultats suggèrent que la capacité catalytique des DfrB a évolué indépendamment de l’introduction de l’antibiotique triméthoprime, et donc que ce mécanisme de résistance existait dans l’environnement préalablement à son recrutement génomique dans un contexte clinique.
Ces travaux contribuent à notre compréhension fondamentale des mécanismes sous-jacents à l’émergence de l’activité catalytique au sein d’un domaine protéique non catalytique, et informent les études des mécanismes développés par les bactéries pour proliférer en présence d’antibiotiques. / The ability of enzymes to evolve plays a fundamental role in the adaptation of organisms to their environment, allowing them to adjust to changes in temperature, available nutrients, or the introduction of cytotoxic compounds. In recent decades, this ability has led to the rapid emergence of antibiotic resistance mechanisms in human pathogenic bacteria, particularly in the case of the synthetic antibiotic trimethoprim. Ten years after the introduction of this antibiotic, the type B dihydrofolate reductase (DfrB) was identified as conferring resistance to bacteria expressing it by providing an alternative enzyme to catalyze the reaction inhibited by the antibiotic.
Structural, kinetic, and mechanistic studies of DfrB have revealed its atypical nature and suggest that this enzyme is a model of a primitive enzyme. In particular, its unique active site is formed by the interface of four identical protomers. Since DfrB enzymes are not evolutionarily related to any known and characterized proteins, it is not known how they evolved to ultimately contribute to trimethoprim resistance, and in particular how their catalytic ability arose within the small domain encoded by their genes. Thus, this thesis aims to deepen our understanding of enzyme evolution by specifically examining the evolution of DfrB and the properties that guided this process.
Since DfrB genes have rarely been reported, I first present our efforts to genomically identify and characterize DfrB in public databases. These efforts led to the first discovery of DfrB genes outside the clinical context. We then biophysically and enzymatically characterized protein homologues of the DfrB we identified in putative protein databases. We demonstrated the ability of homologues identified in environmental contexts unrelated to human activities to catalyze dihydrofolate reduction in the same manner as DfrB. Finally, a broad search for sequence homologues, followed by experimental and computational characterization, allowed us to identify distant DfrB homologues, some capable of conferring resistance to trimethoprim and others lacking this ability. These results have allowed us to propose a model that explains the emergence of catalytic activity within the DfrB domain.
In summary, this thesis presents a multidisciplinary approach to explore and characterize the sequence space of a protein family. This approach, which includes genomic, enzymatic, biophysical and bioinformatic analyses, has enabled us to identify the structural and sequence features necessary for the formation of a functional DfrB enzyme. We have also proposed a model to explain the evolution of this primitive enzyme. Overall, our results suggest that the catalytic capacity of DfrB evolved independently of the introduction of the antibiotic trimethoprim, and thus that this resistance mechanism existed in the environment prior to its genomic recruitment in a clinical context.
This work contributes to our fundamental understanding of the mechanisms underlying the emergence of catalytic activity within a non-catalytic protein domain, and informs studies of the mechanisms developed by bacteria to proliferate in the presence of antibiotics.
|
3 |
A theory of multiplier functions and sequences and its applications to Banach spaces / I.M. SchoemanSchoeman, Ilse Maria January 2005 (has links)
Thesis (Ph.D. (Mathematics))--North-West University, Potchefstroom Campus, 2006.
|
4 |
A theory of multiplier functions and sequences and its applications to Banach spaces / Ilse Maria SchoemanSchoeman, Ilse Maria January 2005 (has links)
Abstract does not display correctly / Thesis (Ph.D. (Mathematics))--North-West University, Potchefstroom Campus, 2006
|
5 |
A theory of multiplier functions and sequences and its applications to Banach spaces / Ilse Maria SchoemanSchoeman, Ilse Maria January 2005 (has links)
Abstract does not display correctly / Thesis (Ph.D. (Mathematics))--North-West University, Potchefstroom Campus, 2006
|
Page generated in 0.0717 seconds