Global ETD Search

11	Analysis Of Structural And Functional Types Of Protein-Protein Interactions Nambudiry Rekha, * 02 1900 (has links) (PDF) No description available. Protein-Protein Interactions Protein Structures Protein Interfaces Human Chorionic Gonadotropin (hCG) Protein Kinases RNA Polymerase II Protein Complexes Antibodies Epitope Sites Protein Kinase CK2 Tethered Protein Domain Interacting Proteins Biochemistry
12	From protein sequence to structural instability and disease Wang, Lixiao January 2010 (has links) A great challenge in bioinformatics is to accurately predict protein structure and function from its amino acid sequence, including annotation of protein domains, identification of protein disordered regions and detecting protein stability changes resulting from amino acid mutations. The combination of bioinformatics, genomics and proteomics becomes essential for the investigation of biological, cellular and molecular aspects of disease, and therefore can greatly contribute to the understanding of protein structures and facilitating drug discovery. In this thesis, a PREDICTOR, which consists of three machine learning methods applied to three different but related structure bioinformatics tasks, is presented: using profile Hidden Markov Models (HMMs) to identify remote sequence homologues, on the basis of protein domains; predicting order and disorder in proteins using Conditional Random Fields (CRFs); applying Support Vector Machines (SVMs) to detect protein stability changes due to single mutation. To facilitate structural instability and disease studies, these methods are implemented in three web servers: FISH, OnD-CRF and ProSMS, respectively. For FISH, most of the work presented in the thesis focuses on the design and construction of the web-server. The server is based on a collection of structure-anchored hidden Markov models (saHMM), which are used to identify structural similarity on the protein domain level. For the order and disorder prediction server, OnD-CRF, I implemented two schemes to alleviate the imbalance problem between ordered and disordered amino acids in the training dataset. One uses pruning of the protein sequence in order to obtain a balanced training dataset. The other tries to find the optimal p-value cut-off for discriminating between ordered and disordered amino acids. Both these schemes enhance the sensitivity of detecting disordered amino acids in proteins. In addition, the output from the OnD-CRF web server can also be used to identify flexible regions, as well as predicting the effect of mutations on protein stability. For ProSMS, we propose, after careful evaluation with different methods, a clustered by homology and a non-clustered model for a three-state classification of protein stability changes due to single amino acid mutations. Results for the non-clustered model reveal that the sequence-only based prediction accuracy is comparable to the accuracy based on protein 3D structure information. In the case of the clustered model, however, the prediction accuracy is significantly improved when protein tertiary structure information, in form of local environmental conditions, is included. Comparing the prediction accuracies for the two models indicates that the prediction of mutation stability of proteins that are not homologous is still a challenging task. Benchmarking results show that, as stand-alone programs, these predictors can be comparable or superior to previously established predictors. Combined into a program package, these mutually complementary predictors will facilitate the understanding of structural instability and disease from protein sequence. protein domain remote homologue protein function point mutation protein family protein stability HMMs CRFs SVMs
13	Kelch-like ECH-associated protein 1 (KEAP1) differentially regulates nuclear factor erythroid-2–related factors 1 and 2 (NRF1 and NRF2) Tian, Wang, de la Vega, Montserrat Rojo, Schmidlin, Cody J., Ooi, Aikseng, Zhang, Donna D. 09 February 2018 (has links) Nuclear factor erythroid-2-related factor 1 (NRF1) and NRF2 are essential for maintaining redox homeostasis and coordinating cellular stress responses. They are highly homologous transcription factors that regulate the expression of genes bearing antioxidant-response elements (AREs). Genetic ablation of NRF1 or NRF2 results in vastly different phenotypic outcomes, implying that they play different roles and may be differentially regulated. Kelch-like ECH-associated protein 1 (KEAP1) is the main negative regulator of NRF2 and mediates ubiquitylation and degradation of NRF2 through its NRF2-ECH homology-like domain 2 (Neh2). Here, we report that KEAP1 binds to the Neh2-like (Neh2L) domain of NRF1 and stabilizes it. Consistently, NRF1 is more stable in KEAP1(+/+) than in KEAP1(-/-) isogenic cell lines, whereas NRF2 is dramatically stabilized in KEAP1(-/-) cells. Replacing NRF1's Neh2L domain with NRF2's Neh2 domain renders NRF1 sensitive to KEAP1-mediated degradation, indicating that the amino acids between the DLG and ETGE motifs, not just the motifs themselves, are essential for KEAP1-mediated degradation. Systematic site-directed mutagenesis identified the core amino acid residues required for KEAP1-mediated degradation and further indicated that the DLG and ETGE motifs with correct spacing are insufficient as a KEAP1 degron. Our results offer critical insights into our understanding of the differential regulation of NRF1 and NRF2 by KEAP1 and their different physiological roles. protein degradation protein domain protein motif protein stability KEAP1 NRF1
14	HOST RESTRICTION FACTORS IN THE REPLICATION OF TOMBUSVIRUSES: FROM RNA HELICASES TO NUCLEOCYTOPLASMIC SHUTTLING Wu, Cheng-Yu 01 January 2019 (has links) Positive-stranded (+)RNA viruses replicate inside cells and depend on many cellular factors to complete their infection cycle. In the meanwhile, (+)RNA viruses face the host innate immunity, such as cell-intrinsic restriction factors that could block virus replication. Firstly, I have established that the plant DDX17-like RH30 DEAD-box helicase conducts strong inhibitory function on tombusvirus replication when expressed in plants and yeast surrogate host. This study demonstrates that RH30 blocks the assembly of viral replicase complex, the activation of RNA-dependent RNA polymerase function of p92pol and viral RNA template recruitment. In addition, the features rendering the abundant plant DEAD-box helicases either antiviral or pro-viral functions in tombusvirus replication are intriguing. I found the reversion of the antiviral function of DDX17-like RH30 DEAD-box helicase and the coopted pro-viral DDX3-like RH20 helicase due to deletion of unique N-terminal domains. The discovery of the sequence plasticity of DEAD-box helicases that can alter recognition of different cis-acting elements in the viral genome illustrates the evolutionary potential of RNA helicases in the arms race between viruses and their hosts. Moreover, I discovered that Xpo1 possesses an anti-viral function and exports previously characterized cell-intrinsic restriction factors (CIRFs) from the nucleus to the replication compartment of tombusviruses. Altogether, in my PhD studies, I found plant RH30 DEAD-box helicase is a potent host restriction factor inhibiting multiple steps of the tombusvirus replication. In addition, I provided the evidence supporting that the Nterminal domain determines the functions of antiviral DDX17-like RH30 DEAD-box helicase and pro-viral DDX3-like RH20 DEAD-box helicase in tombusvirus replication. Moreover, I discovered the emerging significance of the Xpo1-dependent nuclear export pathway in tombusvirus replication. Positive strand RNA virus DEAD-box RNA helicase protein domain function Nucleocytoplasmic shuttling Xpo1 Agriculture Immunology and Infectious Disease Molecular Biology Virology
15	Protein Structure Data Management System Wang, Yanchao 03 August 2007 (has links) With advancement in the development of the new laboratory instruments and experimental techniques, the protein data has an explosive increasing rate. Therefore how to efficiently store, retrieve and modify protein data is becoming a challenging issue that most biological scientists have to face and solve. Traditional data models such as relational database lack of support for complex data types, which is a big issue for protein data application. Hence many scientists switch to the object-oriented databases since object-oriented nature of life science data perfectly matches the architecture of object-oriented databases, but there are still a lot of problems that need to be solved in order to apply OODB methodologies to manage protein data. One major problem is that the general-purpose OODBs do not have any built-in data types for biological research and built-in biological domain-specific functional operations. In this dissertation, we present an application system with built-in data types and built-in biological domain-specific functional operations that extends the Object-Oriented Database (OODB) system by adding domain-specific additional layers Protein-QL, Protein Algebra Architecture and Protein-OODB above OODB to manage protein structure data. This system is composed of three parts: 1) Client API to provide easy usage for different users. 2) Middleware including Protein-QL, Protein Algebra Architecture and Protein-OODB is designed to implement protein domain specific query language and optimize the complex queries, also it capsulates the details of the implementation such that users can easily understand and master Protein-QL. 3) Data Storage is used to store our protein data. This system is for protein domain, but it can be easily extended into other biological domains to build a bio-OODBMS. In this system, protein, primary, secondary, and tertiary structures are defined as internal data types to simplify the queries in Protein-QL such that the domain scientists can easily master the query language and formulate data requests, and EyeDB is used as the underlying OODB to communicate with Protein-OODB. In addition, protein data is usually stored as PDB format and PDB format is old, ambiguous, and inadequate, therefore, PDB data curation will be discussed in detail in the dissertation. Protein-QL Protein-OODB Domain Specific Query Language Protein Algebra Protein Ontology Protein Wrapper Computer Sciences
16	Predikce proteinových domén / Protein Domains Prediction Valenta, Martin January 2013 (has links) The work is focused on the area of the proteins and their domains. It also briefly describes gathering methods of the protein´s structure at the various levels of the hierarchy. This is followed by examining of existing tools for protein´s domains prediction and databases consisting of domain´s information. In the next part of the work selected representatives of prediction methods are introduced. These methods work with the information about the internal structure of the molecule or the amino acid sequence. The appropriate chapter outlines applied procedure of domains´ boundaries prediction. The prediction is derived from the primary structure of the protein, using a neural network The implemented procedure and its possibility of further development in the related thesis are introduced at the conclusion of this work.
17	Rôle des protéines à domaines GGDEF et/ou EAL chez Legionella pneumophila / Role of the GGDEF/EAL proteins of L. pneumophila Levet-Paulo, Mélanie 11 July 2011 (has links) Legionella pneumophila est un pathogène intracellulaire, agent de la Légionellose, et dont le réservoir dans l’environnement est constitué d’amibes aquatiques comme Acanthamoeba castellani. Mes objectifs de thèse étaient l’identification de mécanismes moléculaires contrôlant la virulence et la multi-résistance chez L. pneumophila, et en particulier l’exploration du rôle des protéines « GGDEF/EAL ». Les domaines GGDEF et EAL sont retrouvés dans des enzymes permettant respectivement de synthétiser (diguanylate cyclase, DGC) ou dégrader (phosphodiestérase, PDE) le di-GMP cyclique, un second messager spécifique des bactéries, qui participe au contrôle de fonctions clés comme la virulence ou la mobilité. L. pneumophila Lens possède 22 gènes codant des protéines GGDEF/EAL, et dont la plupart sont exprimés lorsqu’elle est dans sa phase virulente. L’activité enzymatique des 22 protéines « GGDEF/EAL » a été analysée in vitro : sur 10 protéines purifiées, 6 sont des DGC, dont 2 présentes une double activité DGC/PDE. L’inactivation de 5 gènes des 22 gènes et la surexpression de 2 autres entrainent une baisse de la virulence vis-à-vis d’A. castellanii. De plus, nous avons montré que l’activité DGC d’au moins 2 de ces protéines est requise lors du cycle infectieux. Enfin, nous avons décrit un système à deux composants original comprenant l’histidine kinase (HK) Lpl0330, capable de s’autophosphoryler sur un nouveau domaine HisKA, retrouvé dans 64 autres HK potentielles, et Lpl0329, le premier régulateur de réponse à double activité enzymatique caractérisé, dont la phosphorylation conduit à moduler le taux de di‐GMPc en favorisant une de ses deux activités (Levet-Paulo et al., 2011). / Legionella pneumophila is an intracellular pathogen found in aquatic environments where it replicates in protozoan hosts. My objectives were to identify molecular mechanisms that control virulence and multidrug resistance in L. pneumophila, and especially to explore the role of proteins “GGDEF/EAL” in virulence. GGDEF and EAL domains are found in enzymes able to synthesize (diguanylate cyclase, DGC) or degrade (phosphodiesterase, PDE) c-di-GMP, respectively. C-di-GMP is a bacterial second messenger which plays a key role in regulating main functions including motility, and virulence. L. pneumophila Lens contains 22 genes encoding GGDEF/EAL proteins and most of them are expressed simultaneously with genes encoding virulence factors. The enzymatic activities of the 22 GGDEF/EAL proteins of L. pneumophila Lens were assayed in vitro. Among the 10 proteins purified, 6 showed a DGC activity and 2 contained both activities. The role of the GGDEF/EAL proteins of L. pneumophila Lens on virulence was investigated. Inactivation of 5 genes and overexpression of 2 other genes led to a significant decrease in virulence. Moreover, DGC activity of at least two of these proteins is required for bacterial virulence. Finally, an original two-component system was identified comprising Lpl0330, a histidine kinase able to autophosphorylate on a new HisKA domain, and Lpl 0329, a protein with dual in vitro DGC/PDE activity. Phosphorylation of Lpl0329 led to a decrease in its DGC activity only, giving the first example of a bifunctional enzyme which modulates synthesis and turnover of c-di-GMP in response to phosphorylation (Levet-Paulo et al., 2011). Legionella pneumophila Virulence Protéines à domaines GGDEF/EAL Système à deux composants Régulation Boîte HGN Phosphorylation Activité enzymatique Legionella pneumophila Virulence Protein domain GGDEF/EAL Two-component system Control HGN box Phosphorylation Enzyme activity 579.33
18	Computational Studies on Structures and Functions of Single and Multi-domain Proteins Mehrotra, Prachi January 2017 (has links) (PDF) Proteins are essential for the growth, survival and maintenance of the cell. Understanding the functional roles of proteins helps to decipher the working of macromolecular assemblies and cellular machinery of living organisms. A thorough investigation of the link between sequence, structure and function of proteins, helps in building a comprehensive understanding of the complex biological systems. Proteins have been observed to be composed of single and multiple domains. Analysis of proteins encoded in diverse genomes shows the ubiquitous nature of multi-domain proteins. Though the majority of eukaryotic proteins are multi-domain in nature, 3-D structures of only a small proportion of multi-domain proteins are known due to difficulties in crystallizing such proteins. While functions of individual domains are generally extensively studied, the complex interplay of functions of domains is not well understood for most multi-domain proteins. Paucity of structural and functional data, affects our understanding of the evolution of structure and function of multi-domain proteins. The broad objective of this thesis is to achieve an enhanced understanding of structure and function of protein domains by computational analysis of sequence and structural data. Special attention is paid in the first few chapters of this thesis on the multi-domain proteins. Classification of multi-domain proteins by implementation of an alignment-free sequence comparison method has been achieved in Chapters 2 and 3. Studies on organization, interactions and interdependence of domain-domain interactions in multi-domain proteins with respect to sequential separation between domains and N to C-terminal domain order have been described in Chapters 4 and 5. The functional and structural repertoire of organisms can be comprehensively studied and compared using functional and structural domain annotations. Chapter 6, 7 and 8 represent the proteome-wide structure and function comparisons of various pathogenic and non-pathogenic microorganisms. These comparisons help in identifying proteins implicated in virulence of the pathogen and thus predict putative targets for disease treatment and prevention. Chapter 1 forms an introduction to the main subject area of this thesis. Starting with describing protein structure and function, details of the four levels of hierarchical organization of protein structure have been provided, along with the databases that document protein sequences and structures. Classification of protein domains considered as the realm of function, structure and evolution has been described. The usefulness of classification of proteins at the domain level has been highlighted in terms of providing an enhanced understanding of protein structure and function and also their evolutionary relatedness. The details of structure, function and evolution of multi-domain proteins have also been outlined in chapter 1. ! Chapter 2 aims to achieve a biologically meaningful classification scheme for multi-domain protein sequences. The overall function of a multi-domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence-based methods utilize only the domain-level information to classify proteins. This does not take into account the contributions of accessory domains and linker regions towards the overall function of a multi-domain protein. An alignment-free protein sequence comparison tool, CLAP (CLAssification of Proteins) previously developed in this laboratory, was assessed and improved when the author joined the group. CLAP was developed especially to handle multi-domain protein sequences without a requirement of defining domain boundaries and sequential order of domains (domain architecture). ! The working principle of CLAP involves comparison of all against all windows of 5-residue sequence patterns between two protein sequences. The sequences compared could be full-length comprising of all the domains in the two proteins. This compilation of comparison is represented as the Local Matching Scores (LMS) between protein sequences (nslab.iisc.ernet.in/clap/). It has been previously shown that the execution time of CLAP is ~7 times faster than other protein sequence comparison methods that employ alignment of sequences. In Chapter 2, CLAP-based classification has been carried out on two test datasets of proteins containing (i) Tyrosine phosphatase domain family and (ii) SH3-domain family. The former dataset comprises both single and multi-domain proteins that sometimes consist of domain repeats of the tyrosine phosphatase domain. The latter dataset consists only of multi-domain proteins with one copy of the SH3-domain. At the domain-level CLAP-based classification scheme resulted in a clustering similar to that obtained from an alignment-based method, ClustalW. CLAP-based clusters obtained for full-length datasets were shown to comprise of proteins with similar functions and domain architectures. Hence, a protein classification scheme is shown to work efficiently that is independent of domain definitions and requires only the full-length amino acid sequences as input.! Chapter 3 explores the limitations of CLAP in large-scale protein sequence comparisons. The potential advantages of full-length protein sequence classification, combined with the availability of the alignment-free sequence comparison tool, CLAP, motivated the conceptualization of full-length sequence classification of the entire protein repertoire. Before undertaking this mammoth task, working of CLAP was tested for a large dataset of 239,461 protein sequences. Chapter 3 discusses the technical details of computation, storage and retrieval of CLAP scores for a large dataset in a feasible timeframe. CLAP scores were examined for protein pairs of same domain architecture and ~22% of these showed 0 CLAP similarity scores. This led to investigation of the sensitivity of CLAP with respect to sequence divergence. Several test datasets of proteins belonging to the same SCOP fold were constructed and CLAP-based classification of these proteins was examined at inter and intra-SCOP family level. CLAP was successful in efficiently clustering evolutionary related proteins (defined as proteins within the same SCOP superfamily) if their sequence identity >35%. At lower sequence identities, CLAP fails to recognize any evolutionary relatedness. Another test dataset consisting of two-domain proteins with domain order swapped was constructed. Domain order swap refers to domain architectures of type AB and BA, consisting of domains A and B. A condition that the sequence identities of homologous domains were greater than 35% was imposed. CLAP could effectively cluster together proteins of the same domain architectures in this case. Thus, the sequence identity threshold of 35% at the domain-level improves the accuracy of CLAP. The analysis also showed that for highly divergent sequences, the expectation of 5-residue pattern match was likely a stringent criterion. Thus, a modification in the 5-residue identical pattern match criterion, by considering even similar residue and gaps within matched patterns may be required to effectuate CLAP-based clustering of remotely related protein sequences. Thus, this study highlights the limitations of CLAP with respect to large-scale analysis and its sensitivity to sequence divergence. ! Chapters 4 and 5 discuss the computational analysis of inter-domain interactions with respect to sequential distance and domain order. Knowledge of domain composition and 3-D structures of individual domains in a multi-domain protein may not be sufficient to predict the tertiary structure of the multi-domain protein. Substantial information about the nature of domain-domain interfaces helps in prediction of the tertiary as well as the quaternary structure of a protein. Therefore, chapter 4 explores the possible relationship between the sequential distance separating two domains in a multi-domain protein and the extent of their interaction. With increasing sequential separation between any two domains, the extent of inter-domain interactions showed a gradual decrease. The trend was more apparent when sequential separation between domains is measured in terms of number of intervening domains. Irrespective of the linker length, extensive interactions were seen more often between contiguous domains than between non-contiguous domains. Contiguous domains show a broader interface area and lower proportion of non-interacting domains (interface area: 0 Å2 to - 4400 Å2, 2.3% non-interacting domains) than non-contiguous domains (interface area: 0 Å2 to - 2000 Å2, 34.7% non-interacting domains). Additionally, as inter-protein interactions are mediated through constituent domains, rules of protein-protein interactions were applied to domain-domain interactions. Tight binding between domains is denoted as putative permanent domain-domain interactions and domains that may dissociate and associate with relatively weak interactions to regulate functional activity are denoted as putative transient domain-domain interactions. An interface area threshold of 600 Å2 was utilized as a binary classifier to distinguish between putative permanent and putative transient domain-domain interactions. Therefore, the state of interaction of a domain pair is defined as either putative permanent or putative transient interaction. Contiguous domains showed a predominance of putative permanent nature of inter-domain interface, whereas non-contiguous domains showed a prevalence of putative transient interfaces. The state of interaction of various SCOP superfamily pairs was studied across different proteins in the dataset. SCOP superfamily pairs mostly showed a conserved state of interaction, i.e. either putative permanent or putative transient in all their occurrences across different proteins. Thus, it is noted that contiguous domains interact extensively more often than non-contiguous domains and specific superfamily pairs tend to interact in a conserved manner. In conclusion, a combination of interface area and other inter-domain properties along with experimental validation will help strengthen the binary classification scheme of putative permanent and transient domain-domain interactions.! Chapter 5 provides structural analysis of domain pairs occurring in different sequential domain orders in mutli-domain proteins. The function and regulation of a multi-domain protein is predominantly determined by the domain-domain interactions. These in turn are influenced by the sequential order of domains in a protein. With domains defined using evolutionary and structural relatedness (SCOP superfamily), their conservation of structure and function was studied across domain order reversal. A domain order reversal indicates different sequential orders of the concerned domains, which may be identified in proteins of same or different domain compositions. Domain order reversals of domains A and B can be indicated in protein pair consisting of the domain architectures xAxBx and xBxAx, where x indicates 0 or more domains. A total of 161 pairs of domain order reversals were identified in 77 pairs of PDB entries. For most of the comparisons between proteins with different domain composition and architecture, large differences in the relative spatial orientation of domains were observed. Although preservation of state of interaction was observed for ~75% of the comparisons, none of the inter-domain interfaces of domains in different order displayed high interface similarity. These domain order reversals in multi-domain proteins are contributed by a limited number of 15 SCOP superfamilies. Majority of the superfamilies undergoing order reversal either function as transporters or regulatory domains and very few are enzymes. A higher proportion of domain order reversals were observed in domains separated by 0 or 1 domains than those separated by more than 1 domain. A thorough analysis of various structural features of domains undergoing order reversal indicates that only one order of domains is strongly preferred over all possible orders. This may be due to either evolutionary selection of one of the orders and its conservation throughout generations, or the fact that domain order reversals rarely conserve the interface between the domains. Further studies (Chapters 6 to 8) utilize the available computational techniques for structural and functional annotation of proteins encoded in a few bacterial genomes. Based on these annotations, proteome-wide structure and function comparisons were performed between two sets of pathogenic and non-pathogenic bacteria. The first study compares the pathogenic Mycobacterium tuberculosis to the closely related organism Mycobacterium smegmatis which is non-pathogenic. The second study primarily identified biologically feasible host-pathogen interactions between the human host and the pathogen Leptospira interrogans and also compared leptospiral-host interactions of the pathogenic Leptospira interrogans and of the saprophytic Leptospira biflexa with the human host. Chapter 6 describes the function and structure annotation of proteins encoded in the genome of M. smegmatis MC2-155. M. smegmatis is a widely used model organism for understanding the pathophysiology of M. tuberculosis, the primary causative agent of tuberculosis in humans. M. smegmatis and M. tuberculosis species of the mycobacterial genus share several features like a similar cell-wall architecture, the ability to oxidise carbon monoxide aerobically and share a huge number of homologues. These features render M. smegmatis particularly useful in identifying critical cellular pathways of M. tuberculosis to inhibit its growth in the human host. In spite of the similarities between M. smegmatis and M. tuberculosis, there are stark differences between the two due to their diverse niche and lifestyle. While there are innumerable studies reporting the structure, function and interaction properties of M. tuberculosis proteins, there is a lack of high quality annotation of M. smegmatis proteins. This makes the understanding of the biology of M. smegmatis extremely important for investigating its competence as a good model organism for M. tuberculosis. With the implementation of available sequence and structural profile-based search procedures, functional and structural characterization could be achieved for ~92% of the M. smegmatis proteome. Structural and functional domain definitions were obtained for a total of 5695 of 6717 proteins in M. smegmatis. Residue coverage >70% was achieved for 4567 proteins, which constitute ~68% of the proteome. Domain unassigned regions more than 30 residues were assessed for their potential to be associated to a domain. For 1022 proteins with no recognizable domains, putative structural and functional information was inferred for 328 proteins by the use of distance relationship detection and fold recognition methods. Although 916 sequences of 1022 proteins with no recognizable domains were found to be specific to M. smegmatis species, 98 of these are specific to its MC2-155 strain. Of the 1828 M. smegmatis proteins classified as conserved hypothetical proteins, 1038 proteins were successfully characterized. A total of 33 Domains of Unknown Function (DUFs) occurring in M. smegmatis could be associated to structural domains. A high representation of the tetR and GntR family of transcription regulators was noted in the functional repertoire of M. smegmatis proteome. As M. smegmatis is a soil-dwelling bacterium, transcriptional regulators are crucial for helping it to adapt and survive the environmental stress. Similarly, the ABC transporter and MFS domain families are highly represented in the M. smegmatis proteome. These are important in enabling the bacteria to uptake carbohydrate from diverse environmental sources. A lower number of virulent proteins were identified in M. smegmatis, which justifies its non-pathogenicity. Thus, a detailed functional and structural annotation of the M. smegmatis proteome was achieved in Chapter 6. Chapter 7 delineates the similarities and difference in the structure and function of proteins encoded in the genomes of the pathogenic M. tuberculosis and the non-pathogenic M. smegmatis. The protocol employed in Chapter 6 to achieve the proteome-wide structure and function annotation of M. smegmatis was also applied to M. tuberculosis proteome in Chapter 7. The number of proteins encoded by the genome of M. smegmatis strain MC2-155 (6717 proteins) is comparatively higher than that in M. tuberculosis strain H37Rv (4018 proteins). A total of 2720 high confidence orthologues sharing ≥30% sequence identity were identified in M. tuberculosis with respect to M. smegmatis. Based on the orthologue information, specific functional clusters, essential proteins, metabolic pathways, transporters and toxin-antitoxin systems of M. tuberculosis were inspected for conservation in M. smegmatis. Among the several categories analysed, 53 metabolic pathways, 44 membrane transporter proteins belonging to secondary transporters and ATP-dependent transporter classes, 73 toxin-antitoxin systems, 23 M. tuberculosis-specific targets, 10 broad-spectrum targets and 34 targets implicated in persistence of M. tuberculosis could not detect any orthologues in M. smegmatis. Several of the MFS superfamily transporters act as drug efflux pumps and are hence associated with drug resistance in M. tuberculosis. The relative abundances of MFS and ABC superfamily transporters are higher in M. smegmatis than in M. tuberculosis. As these transporters are involved in carbohydrate uptake, their higher representation in M. smegmatis than in M. tuberculosis highlights the lack of proficiency of M. tuberculosis to assimilate diverse carbon sources. In the case of porins, MspA-like and OmpA-like porins are selectively present in either M. smegmatis or M. tuberculosis. These differences help to elucidate protein clusters for which M. smegmatis may not be the best model organism to study M. tuberculosis proteins.! At the domain-level, ATP-binding domain of ABC transporters, tetracycline transcriptional regulator (tetR) domain family, major facilitator superfamily (MFS) domain family, AMP-binding domain family and enoyl-CoA hydrolase domain family are highly represented in both M. smegmatis and M. tuberculosis proteomes. These domains play an essential role in the carbohydrate uptake systems and drug-efflux pumps among other diverse functions in mycobacteria. There are several differentially represented domain families in M. tuberculosis and M. smegmatis. For example, the pentapeptide-repeat domain, PE, PPE and PIN domains although abundantly present in M. tuberculosis, are very rare in M. smegmatis. Therefore, such uniquely or differentially represented functional and structural domains in M. tuberculosis as compared to M. smegmatis may be linked to pathogenicity or adaptation of M. tuberculosis in the host. Hence, major differences between M. tuberculosis and M. smegmatis were identified, not only in terms of domain populations but also in terms of domain combinations. Thus, Chapter 7 highlights the similarities and differences between M. smegmatis and M. tuberculosis proteomes in terms of structure and function. These differences provide an understanding of selective utilization of M. smegmatis as a model organism to study M. tuberculosis. ! In Chapter 8, computational tools have been employed to predict biologically feasible host-pathogen interactions between the human host and the pathogenic, Leptospira interrogans. Sensitive profile-based search procedures were used to specifically identify practical drug targets in the genome of Leptospira interrogans, the causative agent of the globally widespread zoonotic disease, Leptospirosis. Traditionally, the genus Leptospira is classified into two species complex- the pathogenic L. interrogans and the non-pathogenic saprophyte L. biflexa. The pathogen gains entry into the human host through direct or indirect contact with fluids of infected animals. Several ambiguities exist in the understanding of L. interrogans pathogenesis. An integration of multiple computational approaches guided by experimentally derived protein-protein interactions, was utilized for recognition of host-pathogen protein-protein interactions. The initial step involved the identification of similarities of host and L. interrogans proteins with crystal structures of experimentally known transient protein-protein complexes. Further, conservation of interfacial nature was used to obtain high confidence predictions for putative host-pathogen protein-protein interactions. These predictions were subjected to further selection based on subcellular localization of proteins of the human host and L. interrogans, and tissue-specific expression profiles of the host proteins. A total of 49 protein-protein interactions mediated by 24 L. interrogans proteins and 17 host proteins were identified and these may be subjected to further experimental investigations to assess their in vivo relevance. The functional relevance of similarities and differences between the pathogenic and non-pathogenic leptospires in terms of interactions with the host has also been explored. For this, protein-protein interactions across human host and the non-pathogenic saprophyte L. biflexa were also predicted. Nearly 39 leptospiral-host interactions were recognized to be similar across both the pathogen and saprophyte in the context of processes that influence the host. The overlapping leptospiral-host interactions of L. interrogans and L. biflexa proteins with the human host proteins are primarily associated with establishment of its entry into the human host. These include adhesion of the leptospiral proteins to host cells, survival in host environment such as iron acquisition and binding to components of extracellular matrix and plasma. The disjoint sets of leptospiral-host interactions are species-specific interactions, more importantly indicative of the establishment of infection by L. interrogans in the human host and immune clearance of L. biflexa by the human host. With respect to L. interrogans, these specific interactions include interference with blood coagulation cascade and dissemination to target organs by means of disruption of cell junction assembly. On the other hand, species-specific interactions of L. biflexa proteins include those with components of host immune system. ! In spite of the limited availability of experimental evidence, these help in identifying functionally relevant interactions between host and pathogen by integrating multiple lines of evidence. Thus, inferences from computational prediction of host-pathogen interactions act as guidelines for experimental studies investigating the in vivo relevance of these predicted protein-protein interactions. This will further help in developing effective measures for treatment and disease prevention. In summary, Chapters 2 and 3 describe the implementation, advantages and limitations of the alignment-free full-length sequence comparison method, CLAP. Chapter 4 and 5 are dedicated to understand the domain-domain interactions in multi-domain protein sequences and structures. In Chapters 6, 7 and 8 the computational analyses of the mycobacterial species and leptospiral species helped in an enhanced understanding of the functional repertoire of these bacteria. These studies were undertaken by utilizing the biological sequence data available in public databases and implementation of powerful homology-detection techniques. The supplemental data associated with the chapters is provided in a compact disc attached with this thesis.! Proteins - Building Blocks Protein Sequences Protein Domain Hidden Markov Models (HMM) Multi-domain Proteins Mycobacterium smegmatis MC2-155 Mycobacterium tuberculosis Proteomes Leptospira Interrogans Leptospira Biflexa Proteomes Leptospira Biflexa Genomes Mycobacterium tuberculosis H37Rv Mathematics

Search results