1 |
From protein sequence to structural instability and diseaseWang, Lixiao January 2010 (has links)
A great challenge in bioinformatics is to accurately predict protein structure and function from its amino acid sequence, including annotation of protein domains, identification of protein disordered regions and detecting protein stability changes resulting from amino acid mutations. The combination of bioinformatics, genomics and proteomics becomes essential for the investigation of biological, cellular and molecular aspects of disease, and therefore can greatly contribute to the understanding of protein structures and facilitating drug discovery. In this thesis, a PREDICTOR, which consists of three machine learning methods applied to three different but related structure bioinformatics tasks, is presented: using profile Hidden Markov Models (HMMs) to identify remote sequence homologues, on the basis of protein domains; predicting order and disorder in proteins using Conditional Random Fields (CRFs); applying Support Vector Machines (SVMs) to detect protein stability changes due to single mutation. To facilitate structural instability and disease studies, these methods are implemented in three web servers: FISH, OnD-CRF and ProSMS, respectively. For FISH, most of the work presented in the thesis focuses on the design and construction of the web-server. The server is based on a collection of structure-anchored hidden Markov models (saHMM), which are used to identify structural similarity on the protein domain level. For the order and disorder prediction server, OnD-CRF, I implemented two schemes to alleviate the imbalance problem between ordered and disordered amino acids in the training dataset. One uses pruning of the protein sequence in order to obtain a balanced training dataset. The other tries to find the optimal p-value cut-off for discriminating between ordered and disordered amino acids. Both these schemes enhance the sensitivity of detecting disordered amino acids in proteins. In addition, the output from the OnD-CRF web server can also be used to identify flexible regions, as well as predicting the effect of mutations on protein stability. For ProSMS, we propose, after careful evaluation with different methods, a clustered by homology and a non-clustered model for a three-state classification of protein stability changes due to single amino acid mutations. Results for the non-clustered model reveal that the sequence-only based prediction accuracy is comparable to the accuracy based on protein 3D structure information. In the case of the clustered model, however, the prediction accuracy is significantly improved when protein tertiary structure information, in form of local environmental conditions, is included. Comparing the prediction accuracies for the two models indicates that the prediction of mutation stability of proteins that are not homologous is still a challenging task. Benchmarking results show that, as stand-alone programs, these predictors can be comparable or superior to previously established predictors. Combined into a program package, these mutually complementary predictors will facilitate the understanding of structural instability and disease from protein sequence.
|
2 |
Structural Information and Hidden Markov Models for Biological Sequence AnalysisTångrot, Jeanette January 2008 (has links)
Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence. / Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd. / Jeanette Hargbo.
|
3 |
Structural and Functional Characterization of O-Antigen Translocation and Polymerization in Pseudomonas aeruginosa PAO1Islam, Salim Timo 07 June 2013 (has links)
Heteropolymeric O antigen (O-Ag)-capped lipopolysaccharide is the principal constituent of the Gram-negative bacterial cell surface. It is assembled via the integral inner membrane (IM) Wzx/Wzy-dependent pathway. In Pseudomonas aeruginosa, Wzx translocates lipid-linked anionic O-Ag subunits from the cytoplasmic to the periplasmic leaflets of the IM, where Wzy polymerizes the subunits to lengths regulated by Wzz1/2. The Wzx and Wzy IM topologies were mapped using random C-terminal-truncation fusions to PhoALacZα, which displays PhoA/LacZ activity dependent upon its subcellular localization. Twelve transmembrane segments (TMS) containing charged residues were identified for Wzx. Fourteen TMS, two sizeable cytoplasmic loops (CL), and two large periplasmic loops (PL3 and PL5 of comparable size) were characterized for Wzy.
Despite Wzy PL3–PL5 sequence homology, these loops were distinguished by respective cationic and anionic charge properties. Site-directed mutagenesis identified functionally-essential Arg residues in both loops. These results led to the proposition of a “catch-and-release” mechanism for Wzy function. The abovementioned Arg residues and intra-Wzy PL3–PL5 sequence homology were conserved among phylogenetically diverse Wzy homologues, indicating widespread potential for the proposed mechanism. Unexpectedly, Wzy CL6 mutations disrupted Wzz1-mediated regulation of shorter O-Ag chains, providing the first evidence for direct Wzy–Wzz interaction.
Mutagenesis studies identified functionally-important charged and aromatic TMS residues localized to either the interior vestibule or TMS bundles in a 3D homology model constructed for Wzx. Substrate-binding or energy-coupling roles were proposed for these residues, respectively. The Wzx interior was found to be cationic, consistent with translocation of anionic O-Ag subunits. To test these hypotheses, Wzx was overexpressed, purified, and reconstituted in proteoliposomes loaded with I−. Common transport coupling ions were introduced to “open” the protein and allow detection of I− flux via reconstituted Wzx. Extraliposomal changes in H+ induced I− flux, while Na+ addition had no effect, suggesting H+-dependent Wzx gating. Putative energy-coupling residue mutants demonstrated defective H+-dependent halide flux. Wzx also mediated H+ uptake as detected through fluorescence shifts from proteoliposomes loaded with pH-sensitive dye. Consequently, Wzx was proposed to function via H+-coupled antiport. In summary, this research has contributed structural and functional knowledge leading to novel mechanistic understandings for O-Ag biosynthesis in bacteria. / Bookmarks within the document have been provided for ease of access to a particular section in the body of the thesis. Each entry in the Table of Contents, List of Tables, and List of Figures has been "linked" to its respective position and as such can be clicked for direct access to the entry. Similarly, each in-text Figure or Table reference has been "linked" to its respective figure/table for direct access to the entry. / 1.) Canadian Institutes of Health Research (CIHR) Frederick Banting and Charles Best Canada Graduate Scholarship doctoral award, 2.) CIHR Michael Smith Foreign Study Award, 3.) Cystic Fibrosis Canada (CFC) doctoral studentship, 4.) University of Guelph Dean's Tri-Council Scholarship, 5.) Ontario Graduate Scholarship in Science and Technology, 6.) Operating grants to Dr. Joseph S. Lam from CIHR (MOP-14687) and CFC
|
Page generated in 0.0685 seconds