Global ETD Search

1	Protein Fold Recognition Using Adaboost Learning Strategy Su, Yijing 29 September 2010 (has links) Protein structure prediction is one of the most important and difficult problems in computational molecular biology. Unlike sequence-only comparison, protein fold recognition based on machine learning algorithms attempts to detect similarities between protein structures which might not be accompanied with any significant sequence similarity. It takes advantage of the information from structural and physic properties beyond sequence information. In this thesis, we present a novel classifier on protein fold recognition, using AdaBoost algorithm that hybrids to k Nearest Neighbor classifier. The experiment framework consists of two tasks: (i) carry out cross validation within the training dataset, and (ii) test on unseen validation dataset, in which 90% of the proteins have less than 25% sequence identity in training samples. Our result yields 64.7% successful rate in classifying independent validation dataset into 27 types of protein folds. Our experiments on the task of protein folding recognition prove the merit of this approach, as it shows that AdaBoost strategy coupling with weak learning classifiers lead to improved and robust performance of 64.7% accuracy versus 61.2% accuracy in published literatures using identical sample sets, feature representation, and class labels. Adaboost Recognition Learning Strategy Protein Fold
2	Mechanisms and Consequences of Evolving a New Protein Fold Kumirov, Vlad K. January 2016 (has links) The ability of mutations to change the fold of a protein provides evolutionary pathways to new structures. To study hypothetical pathways for protein fold evolution, we designed intermediate sequences between Xfaso1 and Pfl6, two homologous Cro proteins that have 40% sequence identity but adopt all–α and α+β folds, respectively. The designed hybrid sequences XPH1 and XPH2 have 70% sequence identity to each other. XPH1 is more similar in sequence to Xfaso1 (86% sequence identity) while XPH2 is more similar to Pfl6 (80% sequence identity). NMR solution ensembles show that XPH1 and XPH2 have structures intermediate between Xfaso1 and Pfl6. Specifically, XPH1 loses α-helices 5 and 6 of Xfaso1 and incorporates a small amount of β-sheet structure; XPH2 preserves most of the β-sheet of Pfl6 but gains a structure comparable to helix 6 of Xfaso1. These findings illustrate that the sequence space between two natural protein folds may encode a range of topologies, which may allow a protein to change its fold extensively through gradual, multistep mechanisms. Evolving a new fold may have consequences, such as a strained conformation. Here we show that Pfl6 represents an early, strained form of the α+β Cro fold resulting from an ancestral remnant of the all-α Cro proteins retained after the fold switch. This nascent fold can be stabilized through deletion mutations in evolution, which can relieve the strain but may also negatively affect DNA-binding function. Compensatory mutations that increase dimerization appear to offset these effects to maintain function. These findings suggest that new folds can undergo mutational editing through evolution, which may occur in parallel pathways with slightly different outcomes. protein fold transformation protein nmr spectroscopy protein structure Chemistry protein evolution
3	Structural Information and Hidden Markov Models for Biological Sequence Analysis Tångrot, Jeanette January 2008 (has links) Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships. In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%. In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy, and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family. The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences. The other approach presented in the thesis is secondary structure HMMs (ssHMMs). These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model. A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence. / Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens. Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur. Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt. Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa. Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser. saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra. Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen. Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd. / Jeanette Hargbo. HMM structure alignment protein structure secondary structure remote homologue annotation domain family protein family protein superfamily protein fold recognition Bioinformatics Bioinformatik
4	Approaches based on tree-structures classifiers to protein fold prediction Mauricio-Sanchez, David, de Andrade Lopes, Alneu, higuihara Juarez Pedro Nelson 08 1900 (has links) El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / Protein fold recognition is an important task in the biological area. Different machine learning methods such as multiclass classifiers, one-vs-all and ensemble nested dichotomies were applied to this task and, in most of the cases, multiclass approaches were used. In this paper, we compare classifiers organized in tree structures to classify folds. We used a benchmark dataset containing 125 features to predict folds, comparing different supervised methods and achieving 54% of accuracy. An approach related to tree-structure of classifiers obtained better results in comparison with a hierarchical approach. / Revisión por pares Learning systems Protein folding Proteins Trees (mathematics) Benchmark datasets Hierarchical approach Machine learning methods Multi-class classifier Nested dichotomies Protein fold recognition Supervised methods Tree structures

1

Page generated in 0.0488 seconds