11 |
Origin of tRNA Genes in Trypanosoma and Leishmania and Comparison of Eukaryote Phylogenies Obtained from Mitochondrial rRNA and Protein SequencesYang, Xiaoguang January 2005 (has links)
<p> Two studies are presented in this thesis. First part is about the origin of tRNA genes in
Trypanosoma and Leishmania. These organisms have special mitochondrial DNA, termed kinetoplast DNA (kDNA), which is unique in its structure and function. kDNA is a massive network which is composed of thousands of connected DNA circles. Unlike most other mitochondrial genomes, there is no gene encoding tRNAs in their kDNAs. So all the tRNAs used in mitochondria must be encoded on nuclear genes and transported from the cytoplasm into the mitochondria. So our question of interest is where the tRNA genes in their nucleus come from. We carry out phylogenetic analysis of these genes and the corresponding ones in bacteria, mitochondria and eukaryotic nuclei. There is no evidence indicating gene transfer
from mitochondria to nucleus on the basis of this analysis. These results are consistent with the simplest hypothesis, i.e. that all tRNA genes of Trypanosoma and Leishmania have the same origin as nuclear genes of other eukaryotes.</p> <p> The second part is about the comparison of eukaryote phylogenies obtained from mitochondrial rRNA and protein sequences. We carried out phylogenetic analysis for the species which have complete mitochondrial genomes by using both concatenated mitochondrial rRNA and protein sequences. We got phylogenies for three groups, fungi/metazoan, plant/algae and stramenopile/alveolate group. The analysis is useful for the further study of position of the genetic code changes and the mechanisms involved.</p> / Thesis / Master of Science (MSc)
|
12 |
Arquiteturas em hardware para o alinhamento local de sequências biológicas / Hardware architectures for local biological sequence alignmentMallmann, Rafael Mendes January 2010 (has links)
Bancos de dados biológicos utilizados para comparação e alinhamento local de sequências tem crescido de forma exponencial. Isso popularizou programas que realizam buscas nesses bancos. As implementações dos algoritmos de alinhamento de sequências Smith- Waterman e distância Levenshtein demonstraram ser computacionalmente intensivas e, portanto, propícias para aceleração em hardware. Este trabalho descreve arquiteturas em hardware dedicado prototipadas para FPGA e ASIC para acelerar os algoritmos Smith- Waterman e distância Levenshtein mantendo os mesmos resultados obtidos por softwares. Descrevemos uma nova e eficiente unidade de processamento para o cálculo do Smith- Waterman utilizando affine gap. Também projetamos uma arquitetura que permite particionar as sequências de entrada para a distância Levenshtein em um array sistólico de tamanho fixo. Nossa implementação em FPGA para o Smith-Waterman acelera de 275 a 494 vezes o algoritmo em relação a um computador com processador de propósito geral. Ainda é 52 a 113% mais rápida em relação, segundo nosso conhecimento, as mais rápidas arquiteturas recentemente publicadas. / Bioinformatics databases used for sequence comparison and local sequence alignment are growing exponentially. This has popularized programs that carry out database searches. Current implementations of sequence alignment methods based on Smith- Waterman and Levenshtein distance have proven to be computationally intensive and, hence, amenable for hardware acceleration. This Msc. Thesis describes an FPGA and ASIC based hardware implementation designed to accelerate the Smith-Waterman and Levenshtein distance maintaining the same results yielded by general softwares. We describe an new efficient Smith-Waterman affine gap process element and a new architecture to partitioning and maping the Levenshtein distance into fixed size systolic arrays. Our FPGA Smith-Waterman implementation delivers 275 to 494-fold speed-up over a standard desktop computer and is also about 52 to 113% faster, to the best of our knowledge, than the fastest implementation in a most recent family of accelerators.
|
13 |
Arquiteturas em hardware para o alinhamento local de sequências biológicas / Hardware architectures for local biological sequence alignmentMallmann, Rafael Mendes January 2010 (has links)
Bancos de dados biológicos utilizados para comparação e alinhamento local de sequências tem crescido de forma exponencial. Isso popularizou programas que realizam buscas nesses bancos. As implementações dos algoritmos de alinhamento de sequências Smith- Waterman e distância Levenshtein demonstraram ser computacionalmente intensivas e, portanto, propícias para aceleração em hardware. Este trabalho descreve arquiteturas em hardware dedicado prototipadas para FPGA e ASIC para acelerar os algoritmos Smith- Waterman e distância Levenshtein mantendo os mesmos resultados obtidos por softwares. Descrevemos uma nova e eficiente unidade de processamento para o cálculo do Smith- Waterman utilizando affine gap. Também projetamos uma arquitetura que permite particionar as sequências de entrada para a distância Levenshtein em um array sistólico de tamanho fixo. Nossa implementação em FPGA para o Smith-Waterman acelera de 275 a 494 vezes o algoritmo em relação a um computador com processador de propósito geral. Ainda é 52 a 113% mais rápida em relação, segundo nosso conhecimento, as mais rápidas arquiteturas recentemente publicadas. / Bioinformatics databases used for sequence comparison and local sequence alignment are growing exponentially. This has popularized programs that carry out database searches. Current implementations of sequence alignment methods based on Smith- Waterman and Levenshtein distance have proven to be computationally intensive and, hence, amenable for hardware acceleration. This Msc. Thesis describes an FPGA and ASIC based hardware implementation designed to accelerate the Smith-Waterman and Levenshtein distance maintaining the same results yielded by general softwares. We describe an new efficient Smith-Waterman affine gap process element and a new architecture to partitioning and maping the Levenshtein distance into fixed size systolic arrays. Our FPGA Smith-Waterman implementation delivers 275 to 494-fold speed-up over a standard desktop computer and is also about 52 to 113% faster, to the best of our knowledge, than the fastest implementation in a most recent family of accelerators.
|
14 |
Arquiteturas em hardware para o alinhamento local de sequências biológicas / Hardware architectures for local biological sequence alignmentMallmann, Rafael Mendes January 2010 (has links)
Bancos de dados biológicos utilizados para comparação e alinhamento local de sequências tem crescido de forma exponencial. Isso popularizou programas que realizam buscas nesses bancos. As implementações dos algoritmos de alinhamento de sequências Smith- Waterman e distância Levenshtein demonstraram ser computacionalmente intensivas e, portanto, propícias para aceleração em hardware. Este trabalho descreve arquiteturas em hardware dedicado prototipadas para FPGA e ASIC para acelerar os algoritmos Smith- Waterman e distância Levenshtein mantendo os mesmos resultados obtidos por softwares. Descrevemos uma nova e eficiente unidade de processamento para o cálculo do Smith- Waterman utilizando affine gap. Também projetamos uma arquitetura que permite particionar as sequências de entrada para a distância Levenshtein em um array sistólico de tamanho fixo. Nossa implementação em FPGA para o Smith-Waterman acelera de 275 a 494 vezes o algoritmo em relação a um computador com processador de propósito geral. Ainda é 52 a 113% mais rápida em relação, segundo nosso conhecimento, as mais rápidas arquiteturas recentemente publicadas. / Bioinformatics databases used for sequence comparison and local sequence alignment are growing exponentially. This has popularized programs that carry out database searches. Current implementations of sequence alignment methods based on Smith- Waterman and Levenshtein distance have proven to be computationally intensive and, hence, amenable for hardware acceleration. This Msc. Thesis describes an FPGA and ASIC based hardware implementation designed to accelerate the Smith-Waterman and Levenshtein distance maintaining the same results yielded by general softwares. We describe an new efficient Smith-Waterman affine gap process element and a new architecture to partitioning and maping the Levenshtein distance into fixed size systolic arrays. Our FPGA Smith-Waterman implementation delivers 275 to 494-fold speed-up over a standard desktop computer and is also about 52 to 113% faster, to the best of our knowledge, than the fastest implementation in a most recent family of accelerators.
|
15 |
Topology-based Sequence Design For Proteins Structures And Statistical Potentials Sensitive To Local EnvironmentsJha, Anupam Nath 11 1900 (has links) (PDF)
Proteins, which regulate most of the biological activities, perform their functions through their unique three-dimensional structures. The folding process of this three dimensional structure from one dimensional sequence is not well understood. The available facts infer that the protein structures are mostly conserved while sequences are more tolerant to mutations
i.e. a number of sequences can adopt the same fold. These arch of optimal sequences for a chosen conformation is known as inverse protein folding and this thesis takes this approach to solve the enigmatic problem.
This thesis presents a protein sequence design method based on the native state topology of protein structure. The structural importance of the amino acid positions has been converted into the topological parameter of the protein conformation. This scheme of extraction of topology of structures has been successfully applied on three dimensional lattice structures and in turn sequences with minimum energy for a given structure are obtained. This technique along with the reduced amino cid alphabet(A reduced amino acid alphabet is any clustering of twenty amino acids based on some measure of the irrelative similarity) has been applied on the protein structures and hence designed optimal amino acid sequences for a given structure. These designed sequences are energetically much better than the native amino acid sequence. The utility of this method is further confirmed by showing the similarity between naturally occurring and the designed sequences. In summary, a computationally efficient method of designing optimal sequences for a given structure is given.
The physical interaction energy between the amino acids is an important part of study of protein-protein interaction, structure prediction, modeling and docking etc. The local environment of amino acids makes a difference between the same amino acid pairs in the protein structure and so the pair-wise interaction energy of amino acid residues should depend on the irrespective environment. A local environment depended knowledge based potential energy function is developed in this thesis. Two different environments, one of these is the local degree (number of contacts) and the other is the secondary structural element of amino acids, have been considered. The investigations have shown that the environment-based interaction preferences for amino acids is able to provide good potential energy functions which perform exceedingly well in discriminating the native structure from the structures with random interactions.
Further, the membrane proteins are located in a completely different physico-chemical environment with different amino acid composition than the water soluble proteins. This work provides reliable potential energy functions which take care of different environment for the investigation(model/predict) of the structure of helical membrane proteins. Three different environments, parallel and perpendicular to the lipid bilayer and number of amino acid contacts, are explored to analyze the environmental effects on the potential functions. These environment dependent scoring functions perform exceedingly well indiscriminating the native sequence from a set of random sequences.
Hydrophobicity of amino acids is a measure of buriedness or exposure to the aqueous environment. The lack of uniformity within the protein environment gives rise to the different values of hydrophobicity for the same amino acids, which completely depends on its location inside the protein.The contact based environment dependent hydrophobicity values of all amino acids, separately for globular and membrane proteins, have also been evaluated in this thesis.
Apart from developing scoring functions, the packing of helices in membrane proteins is investigated by an approach based on the local backbone geometry and side chain atom-atom contacts of amino acids. A parameter defined in this study is able to capture the essential features of inter-helical packing, which may prove to be useful in modeling of helical membrane proteins.
In conclusion, this thesis has described a novel technique to design the energetically minimized amino acid sequences which can fold in to a given conformation. Also the environment dependent interaction preference of amino acids in globular proteins is captured an efficient manner. Specially, the environment dependent scoring function for helical membrane proteins is a first successful attempt in this direction.
|
16 |
Characterisation and classification of protein sequences by using enhanced amino acid indices and signal processing-based methodsChrysostomou, Charalambos January 2013 (has links)
Protein sequencing has produced overwhelming amount of protein sequences, especially in the last decade. Nevertheless, the majority of the proteins' functional and structural classes are still unknown, and experimental methods currently used to determine these properties are very expensive, laborious and time consuming. Therefore, automated computational methods are urgently required to accurately and reliably predict functional and structural classes of the proteins. Several bioinformatics methods have been developed to determine such properties of the proteins directly from their sequence information. Such methods that involve signal processing methods have recently become popular in the bioinformatics area and been investigated for the analysis of DNA and protein sequences and shown to be useful and generally help better characterise the sequences. However, there are various technical issues that need to be addressed in order to overcome problems associated with the signal processing methods for the analysis of the proteins sequences. Amino acid indices that are used to transform the protein sequences into signals have various applications and can represent diverse features of the protein sequences and amino acids. As the majority of indices have similar features, this project proposes a new set of computationally derived indices that better represent the original group of indices. A study is also carried out that resulted in finding a unique and universal set of best discriminating amino acid indices for the characterisation of allergenic proteins. This analysis extracts features directly from the protein sequences by using Discrete Fourier Transform (DFT) to build a classification model based on Support Vector Machines (SVM) for the allergenic proteins. The proposed predictive model yields a higher and more reliable accuracy than those of the existing methods. A new method is proposed for performing a multiple sequence alignment. For this method, DFT-based method is used to construct a new distance matrix in combination with multiple amino acid indices that were used to encode protein sequences into numerical sequences. Additionally, a new type of substitution matrix is proposed where the physicochemical similarities between any given amino acids is calculated. These similarities were calculated based on the 25 amino acids indices selected, where each one represents a unique biological protein feature. The proposed multiple sequence alignment method yields a better and more reliable alignment than the existing methods. In order to evaluate complex information that is generated as a result of DFT, Complex Informational Spectrum Analysis (CISA) is developed and presented. As the results show, when protein classes present similarities or differences according to the Common Frequency Peak (CFP) in specific amino acid indices, then it is probable that these classes are related to the protein feature that the specific amino acid represents. By using only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient, as biologically related features can appear individually either in the real or the imaginary spectrum. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Upon identification of a new protein, it is important to single out amino acid responsible for the structural and functional classification of the protein, as well as the amino acids contributing to the protein's specific biological characterisation. In this work, a novel approach is presented to identify and quantify the relationship between individual amino acids and the protein. This is successfully demonstrated over the analysis of influenza neuraminidase protein sequences. Characterisation and identification problem of the Influenza A virus protein sequences is tackled through a Subgroup Discovery (SD) algorithm, which can provide ancillary knowledge to the experts. The main objective of the case study was to derive interpretable knowledge for the influenza A virus problem and to consequently better describe the relationships between subtypes of this virus. Finally, by using DFT-based sequence-driven features a Support Vector Machine (SVM)-based classification model was built and tested, that yields higher predictive accuracy than that of SD. The methods developed and presented in this study yield promising results and can be easily applied to proteomic fields.
|
17 |
Modelling and inference for biological systems : from auxin dynamics in plants to protein sequences. / Modélisation et inférence de systèmes biologiques : de la dynamique de l’auxine dans les plantes aux séquences des protéinesGrigolon, Silvia 14 September 2015 (has links)
Tous les systèmes biologiques sont formés d’atomes et de molécules qui interagissent et dont émergent des propriétés subtiles et complexes. Par ces interactions, les organismes vivants peuvent subvenir à toutes leurs fonctions vitales. Ces propriétés apparaissent dans tous les systèmes biologiques à des niveaux différents, du niveau des molécules et gènes jusqu’aux niveau des cellules et tissus. Ces dernières années, les physiciens se sont impliqués dans la compréhension de ces aspects particulièrement intrigants, en particulier en étudiant les systèmes vivants dans le cadre de la théorie des réseaux, théorie qui offre des outils d’analyse très puissants. Il est possible aujourd’hui d’identifier deux classes d’approches qui sont utilisée pour étudier ces types de systèmes complexes : les méthodes directes de modélisation et les approches inverses d’inférence. Dans cette thèse, mon travail est basé sur les deux types d’approches appliquées à trois niveaux de systèmes biologiques. Dans la première partie de la thèse, je me concentre sur les premières étapes du développement des tissus biologiques des plantes. Je propose un nouveau modèle pour comprendre la dynamique collective des transporteurs de l’hormone auxine et qui permet la croissance non-homogène des tissu dans l’espace et le temps. Dans la deuxième partie de la thèse, j’analyse comment l’évolution contraint la diversité́ de séquence des protéines tout en conservant leur fonction dans différents organismes. En particulier, je propose une nouvelle méthode pour inférer les sites essentiels pour la fonction ou la structure de protéines à partir d’un ensemble de séquences biologiques. Finalement, dans la troisième partie de la thèse, je travaille au niveau cellulaire et étudie les réseaux de signalisation associés à l’auxine. Dans ce contexte, je reformule un modèle préexistant et propose une nouvelle technique qui permet de définir et d’étudier la réponse du système aux signaux externes pour des topologies de réseaux différentes. J’exploite ce cadre théorique pour identifier le rôle fonctionnel de différentes topologies dans ces systèmes. / All biological systems are made of atoms and molecules interacting in a non- trivial manner. Such non-trivial interactions induce complex behaviours allow- ing organisms to fulfill all their vital functions. These features can be found in all biological systems at different levels, from molecules and genes up to cells and tissues. In the past few decades, physicists have been paying much attention to these intriguing aspects by framing them in network approaches for which a number of theoretical methods offer many powerful ways to tackle systemic problems. At least two different ways of approaching these challenges may be considered: direct modeling methods and approaches based on inverse methods. In the context of this thesis, we made use of both methods to study three different problems occurring on three different biological scales. In the first part of the thesis, we mainly deal with the very early stages of tissue development in plants. We propose a model aimed at understanding which features drive the spontaneous collective behaviour in space and time of PINs, the transporters which pump the phytohormone auxin out of cells. In the second part of the thesis, we focus instead on the structural properties of proteins. In particular we ask how conservation of protein function across different organ- isms constrains the evolution of protein sequences and their diversity. Hereby we propose a new method to extract the sequence positions most relevant for protein function. Finally, in the third part, we study intracellular molecular networks that implement auxin signaling in plants. In this context, and using extensions of a previously published model, we examine how network structure affects network function. The comparison of different network topologies provides insights into the role of different modules and of a negative feedback loop in particular. Our introduction of the dynamical response function allows us to characterize the systemic properties of the auxin signaling when external stimuli are applied.
|
18 |
Statistical modeling of protein sequences beyond structural prediction : high dimensional inference with correlated data / Modélisation statistique des séquences de protéines au-delà de la prédiction structurelle : inférence en haute dimension avec des données corréléesCoucke, Alice 10 October 2016 (has links)
Grâce aux progrès des techniques de séquençage, les bases de données génomiques ont connu une croissance exponentielle depuis la fin des années 1990. Un grand nombre d'outils statistiques ont été développés à l'interface entre bioinformatique, apprentissage automatique et physique statistique, dans le but d'extraire de l'information de ce déluge de données. Plusieurs approches de physique statistique ont été récemment introduites dans le contexte précis de la modélisation de séquences de protéines, dont l'analyse en couplages directs. Cette méthode d'inférence statistique globale fondée sur le principe d'entropie maximale, s'est récemment montrée d'une efficacité redoutable pour prédire la structure tridimensionnelle de protéines, à partir de considérations purement statistiques.Dans cette thèse, nous présentons les méthodes d'inférence en question, et encouragés par leur succès, explorons d'autres domaines complexes dans lesquels elles pourraient être appliquées, comme la détection d'homologies. Contrairement à la prédiction des contacts entre résidus qui se limite à une information topologique sur le réseau d'interactions, ces nouveaux champs d'application exigent des considérations énergétiques globales et donc un modèle plus quantitatif et détaillé. À travers une étude approfondie sur des donnéesartificielles et biologiques, nous proposons une meilleure interpretation des paramètres centraux de ces méthodes d'inférence, jusqu'ici mal compris, notamment dans le cas d'un échantillonnage limité. Enfin, nous présentons une nouvelle procédure plus précise d'inférence de modèles génératifs, qui mène à des avancées importantes pour des données réelles en quantité limitée. / Over the last decades, genomic databases have grown exponentially in size thanks to the constant progress of modern DNA sequencing. A large variety of statistical tools have been developed, at the interface between bioinformatics, machine learning, and statistical physics, to extract information from these ever increasing datasets. In the specific context of protein sequence data, several approaches have been recently introduced by statistical physicists, such as direct-coupling analysis, a global statistical inference method based on the maximum-entropy principle, that has proven to be extremely effective in predicting the three-dimensional structure of proteins from purely statistical considerations.In this dissertation, we review the relevant inference methods and, encouraged by their success, discuss their extension to other challenging fields, such as sequence folding prediction and homology detection. Contrary to residue-residue contact prediction, which relies on an intrinsically topological information about the network of interactions, these fields require global energetic considerations and therefore a more quantitative and detailed model. Through an extensive study on both artificial and biological data, we provide a better interpretation of the central inferred parameters, up to now poorly understood, especially in the limited sampling regime. Finally, we present a new and more precise procedure for the inference of generative models, which leads to further improvements on real, finitely sampled data.
|
19 |
Computational Studies on Structures and Functions of Single and Multi-domain ProteinsMehrotra, Prachi January 2017 (has links) (PDF)
Proteins are essential for the growth, survival and maintenance of the cell. Understanding the functional roles of proteins helps to decipher the working of macromolecular assemblies and cellular machinery of living organisms. A thorough investigation of the link between sequence, structure and function of proteins, helps in building a comprehensive understanding of the complex biological systems. Proteins have been observed to be composed of single and multiple domains. Analysis of proteins encoded in diverse genomes shows the ubiquitous nature of multi-domain proteins. Though the majority of eukaryotic proteins are multi-domain in nature, 3-D structures of only a small proportion of multi-domain proteins are known due to difficulties in crystallizing such proteins. While functions of individual domains are generally extensively studied, the complex interplay of functions of domains is not well understood for most multi-domain proteins. Paucity of structural and functional data, affects our understanding of the evolution of structure and function of multi-domain proteins.
The broad objective of this thesis is to achieve an enhanced understanding of structure and function of protein domains by computational analysis of sequence and structural data. Special attention is paid in the first few chapters of this thesis on the multi-domain proteins. Classification of multi-domain proteins by implementation of an alignment-free sequence comparison method has been achieved in Chapters 2 and 3. Studies on organization, interactions and interdependence of domain-domain interactions in multi-domain proteins with respect to sequential separation between domains and N to C-terminal domain order have been described in Chapters 4 and 5. The functional and structural repertoire of organisms can be comprehensively studied and compared using functional and structural domain annotations. Chapter 6, 7 and 8 represent the proteome-wide structure and function comparisons of various pathogenic and non-pathogenic microorganisms. These comparisons help in identifying proteins implicated in virulence of the pathogen and thus predict putative targets for disease treatment and prevention.
Chapter 1 forms an introduction to the main subject area of this thesis. Starting with describing protein structure and function, details of the four levels of hierarchical organization of protein structure have been provided, along with the databases that document protein sequences and structures. Classification of protein domains considered as the realm of function, structure and evolution has been described. The usefulness of classification of proteins at the domain level has been highlighted in terms of providing an enhanced understanding of protein structure and function and also their evolutionary relatedness. The details of structure, function and evolution of multi-domain proteins have also been outlined in chapter 1. !
Chapter 2 aims to achieve a biologically meaningful classification scheme for multi-domain protein sequences. The overall function of a multi-domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence-based methods utilize only the domain-level information to classify proteins. This does not take into account the contributions of accessory domains and linker regions towards the overall function of a multi-domain protein. An alignment-free protein sequence comparison tool, CLAP (CLAssification of Proteins) previously developed in this laboratory, was assessed and improved when the author joined the group. CLAP was developed especially to handle multi-domain protein sequences without a requirement of defining domain boundaries and sequential order of domains (domain architecture). !
The working principle of CLAP involves comparison of all against all windows of 5-residue sequence patterns between two protein sequences. The sequences compared could be full-length comprising of all the domains in the two proteins. This compilation of comparison is represented as the Local Matching Scores (LMS) between protein sequences (nslab.iisc.ernet.in/clap/). It has been previously shown that the execution time of CLAP is ~7 times faster than other protein sequence comparison methods that employ alignment of sequences. In Chapter 2, CLAP-based classification has been carried out on two test datasets of proteins containing (i) Tyrosine phosphatase domain family and (ii) SH3-domain family. The former dataset comprises both single and multi-domain proteins that sometimes consist of domain repeats of the tyrosine phosphatase domain. The latter dataset consists only of multi-domain proteins with one copy of the SH3-domain. At the domain-level CLAP-based classification scheme resulted in a clustering similar to that obtained from an alignment-based method, ClustalW. CLAP-based clusters obtained for full-length datasets were shown to comprise of proteins with similar functions and domain architectures. Hence, a protein classification scheme is shown to work efficiently that is independent of domain definitions and requires only the full-length amino acid sequences as input.!
Chapter 3 explores the limitations of CLAP in large-scale protein sequence comparisons. The potential advantages of full-length protein sequence classification, combined with the availability of the alignment-free sequence comparison tool, CLAP, motivated the conceptualization of full-length sequence classification of the entire protein repertoire. Before undertaking this mammoth task, working of CLAP was tested for a large dataset of 239,461 protein sequences. Chapter 3 discusses the technical details of computation, storage and retrieval of CLAP scores for a large dataset in a feasible timeframe. CLAP scores were examined for protein pairs of same domain architecture and ~22% of these showed 0 CLAP similarity scores. This led to investigation of the sensitivity of CLAP with respect to sequence divergence. Several test datasets of proteins belonging to the same SCOP fold were constructed and CLAP-based classification of these proteins was examined at inter and intra-SCOP family level. CLAP was successful in efficiently clustering evolutionary related proteins (defined as proteins within the same SCOP superfamily) if their sequence identity >35%. At lower sequence identities, CLAP fails to recognize any evolutionary relatedness. Another test dataset consisting of two-domain proteins with domain order swapped was constructed. Domain order swap refers to domain architectures of type AB and BA, consisting of domains A and B. A condition that the sequence identities of homologous domains were greater than 35% was imposed. CLAP could effectively cluster together proteins of the same domain architectures in this case. Thus, the sequence identity threshold of 35% at the domain-level improves the accuracy of CLAP. The analysis also showed that for highly divergent sequences, the expectation of 5-residue pattern match was likely a stringent criterion. Thus, a modification in the 5-residue identical pattern match criterion, by considering even similar residue and gaps within matched patterns may be required to effectuate CLAP-based clustering of remotely related protein sequences. Thus, this study highlights the limitations of CLAP with respect to large-scale analysis and its sensitivity to sequence divergence. !
Chapters 4 and 5 discuss the computational analysis of inter-domain interactions with respect to sequential distance and domain order. Knowledge of domain composition and 3-D structures of individual domains in a multi-domain protein may not be sufficient to predict the tertiary structure of the multi-domain protein. Substantial information about the nature of domain-domain interfaces helps in prediction of the tertiary as well as the quaternary structure of a protein. Therefore, chapter 4 explores the possible relationship between the sequential distance separating two domains in a multi-domain protein and the extent of their interaction. With increasing sequential separation between any two domains, the extent of inter-domain interactions showed a gradual decrease. The trend was more apparent when sequential separation between domains is measured in terms of number of intervening domains. Irrespective of the linker length, extensive interactions were seen more often between contiguous domains than between non-contiguous domains. Contiguous domains show a broader interface area and lower proportion of non-interacting domains (interface area: 0 Å2 to - 4400 Å2, 2.3% non-interacting domains) than non-contiguous domains (interface area: 0 Å2 to - 2000 Å2, 34.7% non-interacting domains).
Additionally, as inter-protein interactions are mediated through constituent domains, rules of protein-protein interactions were applied to domain-domain interactions. Tight binding between domains is denoted as putative permanent domain-domain interactions and domains that may dissociate and associate with relatively weak interactions to regulate functional activity are denoted as putative transient domain-domain interactions. An interface area threshold of 600 Å2 was utilized as a binary classifier to distinguish between putative permanent and putative transient domain-domain interactions. Therefore, the state of interaction of a domain pair is defined as either putative permanent or putative transient interaction. Contiguous domains showed a predominance of putative permanent nature of inter-domain interface, whereas non-contiguous domains showed a prevalence of putative transient interfaces. The state of interaction of various SCOP superfamily pairs was studied across different proteins in the dataset. SCOP superfamily pairs mostly showed a conserved state of interaction, i.e. either putative permanent or putative transient in all their occurrences across different proteins. Thus, it is noted that contiguous domains interact extensively more often than non-contiguous domains and specific superfamily pairs tend to interact in a conserved manner. In conclusion, a combination of interface area and other inter-domain properties along with experimental validation will help strengthen the binary classification scheme of putative permanent and transient domain-domain interactions.!
Chapter 5 provides structural analysis of domain pairs occurring in different sequential domain orders in mutli-domain proteins. The function and regulation of a multi-domain protein is predominantly determined by the domain-domain interactions. These in turn are influenced by the sequential order of domains in a protein. With domains defined using evolutionary and structural relatedness (SCOP superfamily), their conservation of structure and function was studied across domain order reversal. A domain order reversal indicates different sequential orders of the concerned domains, which may be identified in proteins of same or different domain compositions. Domain order reversals of domains A and B can be indicated in protein pair consisting of the domain architectures xAxBx and xBxAx, where x indicates 0 or more domains. A total of 161 pairs of domain order reversals were identified in 77 pairs of PDB entries. For most of the comparisons between proteins with different domain composition and architecture, large differences in the relative spatial orientation of domains were observed. Although preservation of state of interaction was observed for ~75% of the comparisons, none of the inter-domain interfaces of domains in different order displayed high interface similarity.
These domain order reversals in multi-domain proteins are contributed by a limited number of 15 SCOP superfamilies. Majority of the superfamilies undergoing order reversal either function as transporters or regulatory domains and very few are enzymes.
A higher proportion of domain order reversals were observed in domains separated by 0 or 1 domains than those separated by more than 1 domain. A thorough analysis of various structural features of domains undergoing order reversal indicates that only one order of domains is strongly preferred over all possible orders. This may be due to either evolutionary selection of one of the orders and its conservation throughout generations, or the fact that domain order reversals rarely conserve the interface between the domains.
Further studies (Chapters 6 to 8) utilize the available computational techniques for structural and functional annotation of proteins encoded in a few bacterial genomes. Based on these annotations, proteome-wide structure and function comparisons were performed between two sets of pathogenic and non-pathogenic bacteria. The first study compares the pathogenic Mycobacterium tuberculosis to the closely related organism Mycobacterium smegmatis which is non-pathogenic. The second study primarily identified biologically feasible host-pathogen interactions between the human host and the pathogen Leptospira interrogans and also compared leptospiral-host interactions of the pathogenic Leptospira interrogans and of the saprophytic Leptospira biflexa with the human host.
Chapter 6 describes the function and structure annotation of proteins encoded in the genome of M. smegmatis MC2-155. M. smegmatis is a widely used model organism for understanding the pathophysiology of M. tuberculosis, the primary causative agent of tuberculosis in humans. M. smegmatis and M. tuberculosis species of the mycobacterial genus share several features like a similar cell-wall architecture, the ability to oxidise carbon monoxide aerobically and share a huge number of homologues. These features render M. smegmatis particularly useful in identifying critical cellular pathways of M. tuberculosis to inhibit its growth in the human host. In spite of the similarities between M. smegmatis and M. tuberculosis, there are stark differences between the two due to their diverse niche and lifestyle. While there are innumerable studies reporting the structure, function and interaction properties of M. tuberculosis proteins, there is a lack of high quality annotation of M. smegmatis proteins. This makes the understanding of the biology of M. smegmatis extremely important for investigating its competence as a good model organism for M. tuberculosis.
With the implementation of available sequence and structural profile-based search procedures, functional and structural characterization could be achieved for ~92% of the M. smegmatis proteome. Structural and functional domain definitions were obtained for a total of 5695 of 6717 proteins in M. smegmatis. Residue coverage >70% was achieved for 4567 proteins, which constitute ~68% of the proteome. Domain unassigned regions more than 30 residues were assessed for their potential to be associated to a domain. For 1022 proteins with no recognizable domains, putative structural and functional information was inferred for 328 proteins by the use of distance relationship detection and fold recognition methods. Although 916 sequences of 1022 proteins with no recognizable domains were found to be specific to M. smegmatis species, 98 of these are specific to its MC2-155 strain. Of the 1828 M. smegmatis proteins classified as conserved hypothetical proteins, 1038 proteins were successfully characterized. A total of 33 Domains of Unknown Function (DUFs) occurring in M. smegmatis could be associated to structural domains.
A high representation of the tetR and GntR family of transcription regulators was noted in the functional repertoire of M. smegmatis proteome. As M. smegmatis is a soil-dwelling bacterium, transcriptional regulators are crucial for helping it to adapt and survive the environmental stress. Similarly, the ABC transporter and MFS domain families are highly represented in the M. smegmatis proteome. These are important in enabling the bacteria to uptake carbohydrate from diverse environmental sources. A lower number of virulent proteins were identified in M. smegmatis, which justifies its non-pathogenicity. Thus, a detailed functional and structural annotation of the M. smegmatis proteome was achieved in Chapter 6.
Chapter 7 delineates the similarities and difference in the structure and function of proteins encoded in the genomes of the pathogenic M. tuberculosis and the non-pathogenic M. smegmatis. The protocol employed in Chapter 6 to achieve the proteome-wide structure and function annotation of M. smegmatis was also applied to M. tuberculosis proteome in Chapter 7. The number of proteins encoded by the genome of M. smegmatis strain MC2-155 (6717 proteins) is comparatively higher than that in M. tuberculosis strain H37Rv (4018 proteins). A total of 2720 high confidence orthologues sharing ≥30% sequence identity were identified in M. tuberculosis with respect to M. smegmatis. Based on the orthologue information, specific functional clusters, essential proteins, metabolic pathways, transporters and toxin-antitoxin systems of M. tuberculosis were inspected for conservation in M. smegmatis.
Among the several categories analysed, 53 metabolic pathways, 44 membrane transporter proteins belonging to secondary transporters and ATP-dependent transporter classes, 73 toxin-antitoxin systems, 23 M. tuberculosis-specific targets, 10 broad-spectrum targets and 34 targets implicated in persistence of M. tuberculosis could not detect any orthologues in M. smegmatis. Several of the MFS superfamily transporters act as drug efflux pumps and are hence associated with drug resistance in M. tuberculosis. The relative abundances of MFS and ABC superfamily transporters are higher in M. smegmatis than in M. tuberculosis. As these transporters are involved in carbohydrate uptake, their higher representation in M. smegmatis than in M. tuberculosis highlights the lack of proficiency of M. tuberculosis to assimilate diverse carbon sources. In the case of porins, MspA-like and OmpA-like porins are selectively present in either M. smegmatis or M. tuberculosis. These differences help to elucidate protein clusters for which M. smegmatis may not be the best model organism to study M. tuberculosis proteins.!
At the domain-level, ATP-binding domain of ABC transporters, tetracycline transcriptional regulator (tetR) domain family, major facilitator superfamily (MFS) domain family, AMP-binding domain family and enoyl-CoA hydrolase domain family are highly represented in both M. smegmatis and M. tuberculosis proteomes. These domains play an essential role in the carbohydrate uptake systems and drug-efflux pumps among other diverse functions in mycobacteria. There are several differentially represented domain families in M. tuberculosis and M. smegmatis. For example, the pentapeptide-repeat domain, PE, PPE and PIN domains although abundantly present in M. tuberculosis, are very rare in M. smegmatis. Therefore, such uniquely or differentially represented functional and structural domains in M. tuberculosis as compared to M. smegmatis may be linked to pathogenicity or adaptation of M. tuberculosis in the host. Hence, major differences between M. tuberculosis and M. smegmatis were identified, not only in terms of domain populations but also in terms of domain combinations. Thus, Chapter 7 highlights the similarities and differences between M. smegmatis and M. tuberculosis proteomes in terms of structure and function. These differences provide an understanding of selective utilization of M. smegmatis as a model organism to study M. tuberculosis. !
In Chapter 8, computational tools have been employed to predict biologically feasible host-pathogen interactions between the human host and the pathogenic, Leptospira interrogans. Sensitive profile-based search procedures were used to specifically identify practical drug targets in the genome of Leptospira interrogans, the causative agent of the globally widespread zoonotic disease, Leptospirosis. Traditionally, the genus Leptospira is classified into two species complex- the pathogenic L. interrogans and the non-pathogenic saprophyte L. biflexa. The pathogen gains entry into the human host through direct or indirect contact with fluids of infected animals. Several ambiguities exist in the understanding of L. interrogans pathogenesis.
An integration of multiple computational approaches guided by experimentally derived protein-protein interactions, was utilized for recognition of host-pathogen protein-protein interactions. The initial step involved the identification of similarities of host and L. interrogans proteins with crystal structures of experimentally known transient protein-protein complexes. Further, conservation of interfacial nature was used to obtain high confidence predictions for putative host-pathogen protein-protein interactions. These predictions were subjected to further selection based on subcellular localization of proteins of the human host and L. interrogans, and tissue-specific expression profiles of the host proteins. A total of 49 protein-protein interactions mediated by 24 L. interrogans
proteins and 17 host proteins were identified and these may be subjected to further experimental investigations to assess their in vivo relevance.
The functional relevance of similarities and differences between the pathogenic and non-pathogenic leptospires in terms of interactions with the host has also been explored. For this, protein-protein interactions across human host and the non-pathogenic saprophyte L. biflexa were also predicted. Nearly 39 leptospiral-host interactions were recognized to be similar across both the pathogen and saprophyte in the context of processes that influence the host. The overlapping leptospiral-host interactions of L. interrogans and L. biflexa proteins with the human host proteins are primarily associated with establishment of its entry into the human host. These include adhesion of the leptospiral proteins to host cells, survival in host environment such as iron acquisition and binding to components of extracellular matrix and plasma. The disjoint sets of leptospiral-host interactions are species-specific interactions, more importantly indicative of the establishment of infection by L. interrogans in the human host and immune clearance of L. biflexa by the human host. With respect to L. interrogans, these specific interactions include interference with blood coagulation cascade and dissemination to target organs by means of disruption of cell junction assembly. On the other hand, species-specific interactions of L. biflexa proteins include those with components of host immune system. !
In spite of the limited availability of experimental evidence, these help in identifying functionally relevant interactions between host and pathogen by integrating multiple lines of evidence. Thus, inferences from computational prediction of host-pathogen interactions act as guidelines for experimental studies investigating the in vivo relevance of these predicted protein-protein interactions. This will further help in developing effective measures for treatment and disease prevention.
In summary, Chapters 2 and 3 describe the implementation, advantages and limitations of the alignment-free full-length sequence comparison method, CLAP. Chapter 4 and 5 are dedicated to understand the domain-domain interactions in multi-domain protein sequences and structures. In Chapters 6, 7 and 8 the computational analyses of the mycobacterial species and leptospiral species helped in an enhanced understanding of the functional repertoire of these bacteria. These studies were undertaken by utilizing the biological sequence data available in public databases and implementation of powerful homology-detection techniques.
The supplemental data associated with the chapters is provided in a compact disc attached with this thesis.!
|
20 |
Classifiers for Discrimination of Significant Protein Residues and Protein-Protein Interaction Using Concepts of Information Theory and Machine Learning / Klassifikatoren zur Unterscheidung von Signifikanten Protein Residuen und Protein-Protein Interaktion unter Verwendung von Informationstheorie und maschinellem LernenAsper, Roman Yorick 26 October 2011 (has links)
No description available.
|
Page generated in 0.0342 seconds