• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 8
  • 7
  • 2
  • 1
  • 1
  • Tagged with
  • 34
  • 34
  • 10
  • 10
  • 9
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Nekovalentní interakce tryptofanu ve struktuře proteinu / Non-covalent interactions of tryptophan in protein structure

Sokol, Albert January 2019 (has links)
A thorough knowledge of non-covalent amino acid interactions within a protein structure is essential for a complete understanding of its conformation, stability and function. Among all the amino acids that usually make up a protein, tryptophan is distinguished both by its rarity and size of its side chain formed by an indole group. It is able to provide various types of indispensable interactions within the protein and between different polypeptide chains, but also between the protein and a biological membrane. In addition, it is the most commonly used natural fluorophore. Databases of solved protein structures are commonly used to study amino acid interactions and allow more or less complex analyzes of the issue. Thus many non-covalent interactions that may occur between tryptophan and other amino acids have been found. However, most of these analyzes focus on specific interactions and do not follow up the tryptophan's environment as a whole, where all amino acids interact. Some newly developed methods have been used in this Thesis, specifically the occurrence profiles of the individual amino acids around the indole group of tryptophan and the results were compared with an available literature. The amino acid that has the greatest preference for tryptophan turned out to be tryptophan again, and...
12

Generalized Simulated Annealing Parameter Sweeping Applied to the Protein Folding Problem / Mapeamento de Parâmetros do Simulated Annealing Generalizado aplicado ao problema do Enovelamento de Proteínas

Flavia Paiva Agostini 06 June 2009 (has links)
Com os rápidos avanços no seqüenciamento do genoma, a compreensão da estrutura de proteínas torna-se uma extensão crucial a esses progressos. Apesar dos significativos avanços tecnológicos recentes, a determinação experimental da estrutura terciária de proteínas ainda é muito lenta se comparada com a taxa de acúmulo de dados das seqüências de aminoácidos. Isto torna o enovelamento de proteínas um problema central para o desenvolvimento da biologia pós-genômica. Em nosso trabalho, fazemos uso de um método de otimização, o Generalized Simulated Annealing (GSA), baseado na termoestatística generalizada por Tsallis. Embora o GSA seja um procedimento geral, sua eficiência depende não apenas da escolha apropriada de parâmetros, mas também das características topológicas da hiper--superfície de energia da função custo. Com o mapeamento dos parâmetros necessários à aplicação do GSA, pode-se reduzir significativamente o número de escolhas, além de tornar possível uma análise do efeito dos parâmetros no comportamento do algoritmo. Como passo inicial, usamos estruturas conhecidas, com as quais os resultados obtidos com o GSA possam ser comparados, como é o caso das polialaninas. Além disso, aplicamos, o GSA a três peptídeos de proteínas ribossomais da família P, de considerável importância no estudo da doença de Chagas. Cada um possui 13 aminoácidos, diferindo em apenas uma mutação não conservativa no terceiro aminoácido. Como os peptídeos não possuem estrutura experimentalmente resolvida, analisamos os resultados obtidos com GSA seguidos por simulações de Dinâmica Molecular. A validade destes resultados é estudada, de forma que, no futuro, estruturas desconhecidas possam ser determinadas com certo grau de confiabilidade. / As the genome sequencing advances, the comprehension of protein structures becomes a crucial extension to these progresses. In spite of the numerous recent technological advances, experimental determination of protein terciary structures is still very slow compared to the accumulated data from amino acid sequences. That is what makes the protein folding a central problem to the development of the pots-genomic era. In this work we use an optimization method, the Generalized Simulated Annealing (GSA), which is based on Tsallis' generalized thermostatistics, to investigate the protein folding problem. Although GSA is a generic procedure, its efficiency depends not only on the appropriate choice of parameters, but also on topological characteristics of the energy hypersurface. By mapping all the GSA parameters, it can be possible to reduce the number of possible choices of them. That also allows an analysis of its effects on the algorithm behavior. As a initial step, we apply GSA to known structures, such as polyalanines. In sequence, we also apply GSA to three more peptides of ribosomal P proteins, which are of considerable importance on the comprehension of Chagas' heart disease. Each one contains 13 amino acids and differ only on the third residue by a non-conservative mutation. As these peptides do not have experimentally resolved structure, we analyze results obtained from GSA followed by Molecular Dynamics simulations. Validity of these results is studied such that, in the future, unknown structures can be determined by this technique with a higher degree of confidence.
13

Resolução de estruturas de proteínas utilizando-se dados de RMN a partir de um algorítmo genético de múltiplos mínimos / Resolution of Protein Structures Using NMR Data by Means of a Genetic Algorithm of Multiple Minima

Marx Gomes Van der Linden 15 April 2009 (has links)
Proteínas são macromoléculas biológicas formadas por polímeros de aminoácidos, as quais estão envolvidas em todos os processos vitais dos organismos, compreendendo um amplo leque de funções. A espectropia por Ressonância Magnética Nuclear (RMN) é, ao lado da difração de raios-X em cristais, uma das duas principais técnicas experimentais capazes de permitir a elucidação da estrutura de proteínas em resolução atômica. A predição de estruturas protéicas utilizando informações experimentais de RMN é um problema de otimização global com restrições. O GAPF é um programa que utiliza um Algoritmo Genético (AG) desenvolvido para predição ab initio -- isto é, para determinação da estrutura de uma proteína apenas a partir do conhecimento de sua seqüência de aminoácidos -- utilizando uma abordagem de múltiplos mínimos, baseada em uma função aptidão derivada de um campo de força molecular clássico. Neste trabalho, é descrito o GAPF-NMR, uma versão derivada do GAPF, que utiliza restrições experimentais de RMN para auxiliar na busca pelas melhores estruturas protéicas correspondentes a uma seqüência dada. Cinco versões diferentes do algoritmo foram desenvolvidas, com diferentes variações na maneira como a função de energia é calculada ao longo da execução. O programa desenvolvido foi aplicado a um conjunto-teste de 7 proteínas de estrutura já conhecida e, para todas elas, foi capaz de chegar a uma estrutura com o enovelamento correto ou aproximado. Foi observado que as versões do algoritmo que aumentam progressivamente a região da proteína usada no cálculo de energia tiveram desempenho superior às demais, e que a abordagem de múltiplos mínimos foi importante para a obtenção de bons resultados. Os resultados foram comparados aos descritos para o GENFOLD -- que é, até o momento, a única implementação alternativa conhecida de um AG para o problema de predição de estruturas a partir de dados de RMN -- e a versão atual do GAPF-NMR se mostrou superior ao GENFOLD na determinação de duas das três proteínas do conjunto-teste deste. / Proteins are biological macromolecules comprised of amino acid polymers that play a wide range of biological roles involved in every process of living organisms. Together with X-ray diffraction, Nuclear Resonance Spectroscopy (NMR) is one of the two main experimental techniques that are capable of delivering atomic-level resolution of protein structures. Prediction of proteic structures using experimental information from NMR experiments is a global optimization problem with restraints. GAPF is a computer program that uses a Genetic Algorithm (GA) developed for ab initio prediction -- the determination of the structure of a protein from its amino acid sequence only -- using a multiple minima approach and a fitness function derived from a classic molecular force field. The work presented here describes GAPF-NMR, an alternate version of GAPF that uses experimental restraints from NMR to support the search for the best protein structures that correspond to a given sequence. Five different versions of the algorithm have been developed, each with a variation on how the energy function is calculated during the course of the program run. GAPF was tested on a test set comprised of 7 proteins with known structure and it was capable of achieving a correct or approximate fold for every one of these proteins. It was noted that the versions of the algorithm that progressively increase the area of the protein used in the energy function have performed better than the other versions, and that the multiple minima approach was important to the achievement of good results. Results were compared to those obtained by GENFOLD -- to the moment, the only known alternate implementation of a GA to the problem of protein structure prediction using NMR data -- and the current version of GAPF was shown to be superior to GENFOLD for two of the three proteins that compose its test set.
14

Modifications moléculaires et fonctionnelles au cours du vieillissement des poudres d’isolats de protéines solubles (WPI) / Molecular and functional changes occurring during ageing of whey protein isolate powders

Norwood, Eve-Anne 13 October 2016 (has links)
Lors de la production des poudres protéiques laitières, des précautions sont prises pour assurer des fonctions technologiques optimales selon leurs utilisations. Cependant, des évolutions structurales et fonctionnelles surviennent lors du stockage. Ce projet vise ainsi à comprendre les mécanismes de modification qui interviennent lors du stockage des poudres d’isolat de protéines solubles du lait (WPI). Pour ce faire, des poudres de WPI ont été stockées en conditions contrôlées (15 mois à 20°C, 40°C et 60°C), et leurs évolutions structurales et fonctionnelles après réhydratation ont été suivies expérimentalement à intervalles réguliers.Les résultats montrent que le vieillissement suit une trajectoire définie impliquant d’abord la lactosylation d’une partie des protéines, puis leur agrégation à l’état sec. De plus, cette trajectoire est caractérisée par un phénomène de rattrapage des modifications observées aux basses températures vers les plus hautes. L’impact des évolutions structurales sur les propriétés fonctionnelles est contrasté : les résultats montrent que les propriétés moussantes et interfaciales sont peu affectées, alors que les propriétés d’agrégation thermo-induites sont grandement modifiées. Pour minimiser ces modifications, deux pistes d’amélioration ont été suivies : 1. la granulométrie des poudres et 2. la teneur en lactose résiduel. La différence de granulométrie n’a pas eu d’influence sur le vieillissement alors que la diminution de la teneur en lactose a permis de limiter significativement l’étendue des modifications induites lors du stockage. / During dairy powder manufacture, precautionary measures are taken to ensure optimal technological functionalities regarding their use requirements. However, changes in structural and functional properties appear during storage. This project aimed understand the mechanisms of changes that occur during storage of whey protein isolate (WPI) powders. To do this, WPI powders were stored under controlled conditions (20°C, 40°C and 60°C for 15 months), and their structural and functional changes after rehydration were experimentally monitored at regular intervals. The results showed that ageing follows a specific path involving first protein lactosylation and then their aggregation in the dry state. In addition, this path was characterized by a catching up behaviour from the changes obtained at lower temperatures to those at highest temperatures. The impact of these structural changes on the functional properties was mixed:the results showed that the foaming and interfacial properties were only slightly affected, while the heat-induced aggregation properties were greatly modified. To minimize these storage-induced changes, several areas for improvement were followed, playing on either the powder particle size, or on lactose content whose role appeared to be crucial for the development of the Maillard reaction. The study showed that the difference in powder particle size had no influence on their ageing path while lowering the lactose concentration allowed to significantly reduce the extent of the storage-induced changes.
15

Stereochemical Analysis On Protein Structures - Lessons For Design, Engineering And Prediction

Gunasekaran, K 12 1900 (has links) (PDF)
No description available.
16

3D struktury fosforylace / 3D structures of phosphorylation

Kielarová, Anežka January 2019 (has links)
Protein phosphorylation is a common post-translational protein modification used in almost all cellular processes. When a phosphate group is added to an amino acid side chain, it may alter the protein conformation and protein-protein interactions due to its size and its negative charge. It may also change the protein function, activity and even localization within the cell. Experimental detection of phosphorylation is still extremely labor demanding and very expensive, even when deploying protein mass spectrometry. For this very reason many bioinformatics scientific groups focus on the prediction of protein phosphorylation sites. Recent analyses of phosphorylation sites studied mainly non-phosphorylated phosphorylation sites and the distribution and representation of amino acids sequentially neighboring them. Since sequentially more distant, but structurally close amino acids can contribute to the recognition of protein substrate by protein kinase, structural environment of phosphorylation sites was studied in this thesis. Furthermore, 3D structures of phosphorylation sites were comprehensively studied for the first time in a phosphorylated state and the results were compared with the results obtained from the analysis of non- phosphorylated sites. Phosphorylation sites were found mostly within...
17

Mapeamento de Parâmetros do Simulated Annealing Generalizado aplicado ao problema do Enovelamento de Proteínas / Generalized Simulated Annealing Parameter Sweeping Applied to the Protein Folding Problem

Agostini, Flavia Paiva 06 June 2009 (has links)
Made available in DSpace on 2015-03-04T18:51:09Z (GMT). No. of bitstreams: 1 TeseFlavia.pdf: 12428230 bytes, checksum: 6fb8e9ea53da0aa51093c702fb32bc4a (MD5) Previous issue date: 2009-06-06 / Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior / As the genome sequencing advances, the comprehension of protein structures becomes a crucial extension to these progresses. In spite of the numerous recent technological advances, experimental determination of protein terciary structures is still very slow compared to the accumulated data from amino acid sequences. That is what makes the protein folding a central problem to the development of the pots-genomic era. In this work we use an optimization method, the Generalized Simulated Annealing (GSA), which is based on Tsallis' generalized thermostatistics, to investigate the protein folding problem. Although GSA is a generic procedure, its efficiency depends not only on the appropriate choice of parameters, but also on topological characteristics of the energy hypersurface. By mapping all the GSA parameters, it can be possible to reduce the number of possible choices of them. That also allows an analysis of its effects on the algorithm behavior. As a initial step, we apply GSA to known structures, such as polyalanines. In sequence, we also apply GSA to three more peptides of ribosomal P proteins, which are of considerable importance on the comprehension of Chagas' heart disease. Each one contains 13 amino acids and differ only on the third residue by a non-conservative mutation. As these peptides do not have experimentally resolved structure, we analyze results obtained from GSA followed by Molecular Dynamics simulations. Validity of these results is studied such that, in the future, unknown structures can be determined by this technique with a higher degree of confidence. / Com os rápidos avanços no seqüenciamento do genoma, a compreensão da estrutura de proteínas torna-se uma extensão crucial a esses progressos. Apesar dos significativos avanços tecnológicos recentes, a determinação experimental da estrutura terciária de proteínas ainda é muito lenta se comparada com a taxa de acúmulo de dados das seqüências de aminoácidos. Isto torna o enovelamento de proteínas um problema central para o desenvolvimento da biologia pós-genômica. Em nosso trabalho, fazemos uso de um método de otimização, o Generalized Simulated Annealing (GSA), baseado na termoestatística generalizada por Tsallis. Embora o GSA seja um procedimento geral, sua eficiência depende não apenas da escolha apropriada de parâmetros, mas também das características topológicas da hiper--superfície de energia da função custo. Com o mapeamento dos parâmetros necessários à aplicação do GSA, pode-se reduzir significativamente o número de escolhas, além de tornar possível uma análise do efeito dos parâmetros no comportamento do algoritmo. Como passo inicial, usamos estruturas conhecidas, com as quais os resultados obtidos com o GSA possam ser comparados, como é o caso das polialaninas. Além disso, aplicamos, o GSA a três peptídeos de proteínas ribossomais da família P, de considerável importância no estudo da doença de Chagas. Cada um possui 13 aminoácidos, diferindo em apenas uma mutação não conservativa no terceiro aminoácido. Como os peptídeos não possuem estrutura experimentalmente resolvida, analisamos os resultados obtidos com GSA seguidos por simulações de Dinâmica Molecular. A validade destes resultados é estudada, de forma que, no futuro, estruturas desconhecidas possam ser determinadas com certo grau de confiabilidade.
18

Motif extraction from complex data : case of protein classification / Extraction de motifs des données complexes : cas de la classification des protéines

Saidi, Rabie 03 October 2012 (has links)
La classification est l’un des défis important en bioinformatique, aussi bien pour les données protéiques que nucléiques. La présence de ces données en grandes masses, leur ambiguïté et en particulier les coûts élevés de l’analyse in vitro en termes de temps et d’argent, rend l’utilisation de la fouille de données plutôt une nécessité qu’un choix rationnel. Cependant, les techniques fouille de données, qui traitent souvent des données sous le format relationnel, sont confrontés avec le format inapproprié des données biologiques. Par conséquent, une étape inévitable de prétraitement doit être établie. Cette thèse traite du prétraitement de données protéiques comme une étape de préparation avant leur classification. Nous présentons l’extraction de motifs comme un moyen fiable pour répondre à cette tâche. Les motifs extraits sont utilisés comme descripteurs, en vue de coder les protéines en vecteurs d’attributs. Cela permet l’utilisation des classifieurs connus. Cependant, la conception d’un espace appropié d’attributs, n’est pas une tâche triviale. Nous traitons deux types de données protéiques à savoir les séquences et les structures 3D. Dans le premier axe, i:e:; celui des séquences, nous proposons un nouveau procédé de codage qui utilise les matrices de substitution d’acides aminés pour définir la similarité entre les motifs lors de l’étape d’extraction. En utilisant certains classifieurs, nous montrons l’efficacité de notre approche en la comparant avec plusieurs autres méthodes de codage. Nous proposons également de nouvelles métriques pour étudier la robustesse de certaines de ces méthodes lors de la perturbation des données d’entrée. Ces métriques permettent de mesurer la capacité d’une méthode de révéler tout changement survenant dans les données d’entrée et également sa capacité à cibler les motifs intéressants. Le second axe est consacré aux structures protéiques 3D, qui ont été récemment considérées comme graphes d’acides aminés selon différentes représentations. Nous faisons un bref survol sur les représentations les plus utilisées et nous proposons une méthode naïve pour aider à la construction de graphes d’acides aminés. Nous montrons que certaines méthodes répandues présentent des faiblesses remarquables et ne reflètent pas vraiment la conformation réelle des protéines. Par ailleurs, nous nous intéressons à la découverte, des sous-structures récurrentes qui pourraient donner des indications fonctionnelles et structurelles. Nous proposons un nouvel algorithme pour trouver des motifs spatiaux dans les protéines. Ces motifs obéissent à un format défini sur la base d’une argumentation biologique. Nous comparons avec des motifs séquentiels et spatiaux de certains travaux reliés. Pour toutes nos contributions, les résultats expérimentaux confirment l’efficacité de nos méthodes pour représenter les séquences et les structures protéiques, dans des tâches de classification. Les programmes développés sont disponibles sur ma page web http://fc.isima.fr/~saidi. / The classification of biological data is one of the significant challenges inbioinformatics, as well for protein as for nucleic data. The presence of these data in hugemasses, their ambiguity and especially the high costs of the in vitro analysis in terms oftime and resources, make the use of data mining rather a necessity than a rational choice.However, the data mining techniques, which often process data under the relational format,are confronted with the inappropriate format of the biological data. Hence, an inevitablestep of pre-processing must be established.This thesis deals with the protein data preprocessing as a preparation step before theirclassification. We present motif extraction as a reliable way to address that task. The extractedmotifs are used as descriptors to encode proteins into feature vectors. This enablesthe use of known data mining classifiers which require this format. However, designing asuitable feature space, for a set of proteins, is not a trivial task.We deal with two kinds of protein data i:e:, sequences and tri-dimensional structures. In thefirst axis i:e:, protein sequences, we propose a novel encoding method that uses amino-acidsubstitution matrices to define similarity between motifs during the extraction step. Wedemonstrate the efficiency of such approach by comparing it with several encoding methods,using some classifiers. We also propose new metrics to study the robustness of some ofthese methods when perturbing the input data. These metrics allow to measure the abilityof the method to reveal any change occurring in the input data and also its ability to targetthe interesting motifs. The second axis is dedicated to 3D protein structures which are recentlyseen as graphs of amino acids. We make a brief survey on the most used graph-basedrepresentations and we propose a naïve method to help with the protein graph making. Weshow that some existing and widespread methods present remarkable weaknesses and do notreally reflect the real protein conformation. Besides, we are interested in discovering recurrentsub-structures in proteins which can give important functional and structural insights.We propose a novel algorithm to find spatial motifs from proteins. The extracted motifsmatch a well-defined shape which is proposed based on a biological basis. We compare withsequential motifs and spatial motifs of recent related works. For all our contributions, theoutcomes of the experiments confirm the efficiency of our proposed methods to representboth protein sequences and protein 3D structures in classification tasks.Software programs developed during this research work are available on my home page http://fc.isima.fr/~saidi.
19

ESTUDOS ESTRUTURAIS DA UROCANATO HIDRATASE DE Trypanosoma cruzi POR MÉTODOS EXPERIMENTAIS E COMPUTACIONAIS

Boreiko, Sheila 26 March 2014 (has links)
Made available in DSpace on 2017-07-24T19:38:13Z (GMT). No. of bitstreams: 1 SHEILA BOREIKO.pdf: 2469578 bytes, checksum: d94618ec9a9eb25acc74ab5a7a6ca5d3 (MD5) Previous issue date: 2014-03-26 / Fundação Araucária de Apoio ao Desenvolvimento Científico e Tecnológico do Paraná / Chagas' disease, caused by the protozoan Trypanosoma cruzi, is one of the seventeen neglected diseases according to World Health Organization. In the last two decades, this parasite specific metabolic pathways have been evaluated as therapeutic targets, making the prospect for the development of more specific and less toxic drugs. To achieve this goal, there is the need for studies to get knowledge on the pathway protein three dimensional structures.Protein structures can be studied experimentally by the X ray diffraction technique and computationally by homology modeling, however, other structural information can also be obtained by spectroscopic techniques. Thus, in this work, structural studies of the enzyme Urocanate Hydratase from Trypanosoma cruzi (TcUH), which participates in the histidine metabolic pathway, were carried out. The enzyme was expressed functionally in E. coli and,by affinity chromatography, effectively purified and crystallized, however, no minimum quality for X-ray diffraction was observed. Thus, we carried out the structural study by circular dichroism (CD), small angle X-ray scattering (SAXS) and homology modeling. The TcUH is mainly composed of α-helices and its denaturation process by temperature starts near 50 ° C, being irreversible after completed. The SAXS study indicated that the protein in solution was not monomeric. With the homology produced model, docking studies indicated that some promising molecules to be carefully studied for possible inhibition tests. / A doença de Chagas, causada pelo protozoário Trypanosoma cruzi, é uma das dezessete doenças negligenciadas de acordo com a Organização Mundial de Saúde. Nas últimas duas décadas, vias metabólicas específicas deste parasita têm sido avaliadas como alvos terapêuticos, o que abre perspectivas para o desenvolvimento de medicamentos mais específicos e menos tóxicos. Para alcançar este objetivo, há a necessidade de estudos para conhecimento da estrutura tridimensional de proteínas que fazem parte destas vias. As estruturas das proteínas podem ser estudadas experimentalmente pela técnica de difração de raios X e computacionalmente pela modelagem por homologia, porém, outras informações estruturais também podem ser obtidas por técnicas espectroscópicas. Sendo assim, realizaramse, neste trabalho, estudos estruturais com a enzima Urocanato Hidratase de Trypanosoma cruzi (TcUH), que participa da via metabólica da histina. A enzima foi expressa em E. coli de forma funcional e, por meio de cromatografia de afinidade, purificada efetivamente e cristalizada, porém, não apresentou qualidade mínima para análise por difração de raios X. Assim, realizou-se o estudo estrutural por meio de dicroísmo circular (CD), espalhamento de raios X a baixo ângulo (SAXS) e modelagem por homologia. A TcUH é constituída majoritariamente por hélices-α e seu processo de desnaturação térmica inicia-se próximo a 50 °C, sendo irreversível após completa. O estudo de SAXS indicou que em solução a enzima não se apresenta monomérica. Com o modelo produzido por homologia, que apresentou razoáveis índices de qualidade, os estudos de docagem indicaram algumas moléculas promissoras que deverão ser estudadas criteriosamente para possíveis testes de inibição.
20

Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and Applications

Mudgal, Richa January 2015 (has links) (PDF)
Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold and biological function of the two proteins would be similar. Homology based information transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks. Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.

Page generated in 0.4508 seconds