• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 12
  • 6
  • 2
  • 1
  • Tagged with
  • 26
  • 26
  • 7
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Structural studies on the anti-HIV1-gp 41 antibody, 1583

Beauchamp, Jeremy January 1994 (has links)
No description available.
2

Studies on the regulatory and catalytic properties of E. coli citrate synthase

Handford, P. A. January 1988 (has links)
No description available.
3

Structural studies on complement factor H and its homologues

Day, A. J. January 1988 (has links)
No description available.
4

Analysis of Madm, a novel adaptor protein that associates with Myeloid Leukemia Factor 1

Lim, Raelene January 2003 (has links)
Myeloid Leukemia Factor 1 (Mlf1) is the murine homolog of MLF1, which was identified as a fusion gene with Nucleophosmin (NPM) resulting from the (3;5)(q25.1;q34) translocation associated with acute myeloid leukemia and myelodysplastic syndrome (Yoneda-Kato et al., 1996). Mlf1 was independently isolated using cDNA representational difference to identify genes up-regulated when an erythroleukemic cell line underwent a lineage switch to display a monoblastoid phenotype (Williams et al., 1999). Mlf1 has been shown to enhance myeloid differentiation and suppress erythroid differentiation; however, its mechanism of action is unknown. A yeast two hybrid screen was employed to identify Mlf1-interacting proteins. This screen isolated a number of known protein, as well as several novel molecules, that bound Mlf1. One of these was 14-3-3ξ, a member of a family of molecules that bind phosphoserine motifs and regulate the subcellular localization of partner proteins. Mlf1 contains a classic RSXSXP sequence for 14-3-3 binding and associated with 14-3-3ξ; via this phosphorylated motif (Lim et al., 2002). The aim of this thesis was to characterise a novel Mlf1-interacting protein that had some homology to protein kinases and was named Mlf1 Adaptor Molecule (Madm). Adaptor proteins are molecules that possess no enzymatic or transcriptional activity, but instead mediate protein-protein interactions. Madm is encoded by a gene consisting of 18 exons and promoter analysis suggested Madm expression might be widespread; indeed Northern blotting of adult tissues and in situ hybridization of embryos demonstrated ubiquitous Madm expression. Significantly, the Madm protein sequence is highly conserved across diverse species. / Madm formed dimers and although it contains a kinase-like domain, the protein lacks several critical residues required for catalytic activity, including an ATP-binding site. Purification of recombinant Madm revealed that the protein was not a kinase; however, studies in mammalian cells showed that Madm associated with a kinase and that Madm was phosphorylated on serine residues in vivo and in vitro. Madm also contains a nuclear localization sequence and nuclear export sequence and was shown to localise to both cytoplasm and nucleus by subcellular fractionation and confocal microscopy. The presence of two nuclear receptor binding motifs (consensus MILL) suggests that Madm may have a functional role in the nucleus. Madm co-immunoprecipitated with Mlf1 and co-localized in the cytoplasm. In addition, the Madm-associated kinase phosphorylated Mlf1 on serine residues, including the RSXSXP motif. In contrast to wild-type Mlf1, the oncogenic fusion protein NPM-MLF1 did not bind 14-3-3i; and localized exclusively in the nucleus. Although Madm co-immunoprecipitated with NPM-MLF1 the binding mechanism was altered. As Mlf1 is able to reprogram erythroleukemic cells to display a monoblastoid phenotype and potentiate myeloid maturation (Williams et al., 1999), the effects of Madm on myeloid differentiation was investigated. However, unlike Mlf1, ectopic expression of Madm in M1 myeloid cells suppressed cytokine-induced differentiation. / In summary, the data presented in this thesis reports on the cloning and characterization of a novel adaptor protein that is involved in the phosphorylation of the proto-oncoprotein MIM. Phosphorylation of Mlf1 is likely to affect its interaction with other proteins, such as 14-3-3~. Complex formation, therefore, may well alter the localization of Mlf1 and Madm, and influence hematopoietic differentiation.
5

Prediction of function shift in protein families /

Abhiman, Saraswathi, January 2006 (has links)
Diss. (sammanfattning) Stockholm : Karolinska institutet, 2006. / Härtill 4 uppsatser.
6

Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and Applications

Mudgal, Richa January 2015 (has links) (PDF)
Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold and biological function of the two proteins would be similar. Homology based information transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks. Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.
7

STORI: selectable taxon ortholog retrieval iteratively

Stern, Joshua Gallant 08 June 2015 (has links)
Speciation and gene duplication are fundamental evolutionary processes that enable biological innovation. For over a decade, biologists have endeavored to distinguish orthology (homology caused by speciation) from paralogy (homology caused by duplication). Disentangling orthology and paralogy is useful to diverse fields such as phylogenetics, protein engineering, and genome content comparison. A common step in ortholog detection is the computation of Bidirectional Best Hits (BBH). However, we found this computation impractical for more than 24 Eukaryotic proteomes. Attempting to retrieve orthologs in less time than previous methods require, we developed a novel algorithm and implemented it as a suite of Perl scripts. This software, Selectable Taxon Ortholog Retrieval Iteratively (STORI), retrieves orthologous protein sequences for a set of user-defined proteomes and query sequences. While the time complexity of the BBH method is O(#taxa^2), we found that the average CPU time used by STORI may increase linearly with the number of taxa. To demonstrate one aspect of STORI’s usefulness, we used this software to infer the orthologous sequences of 26 ribosomal proteins (rProteins) from the large ribosomal subunit (LSU), for a set of 115 Bacterial and 94 Archaeal proteomes. Next, we used established tree-search methods to seek the most probable evolutionary explanation of these data. The current implementation of STORI runs on Red Hat Enterprise Linux 6.0 with installations of Moab 5.3.7, Perl 5 and several Perl modules. STORI is available at: <http://github.com/jgstern/STORI>.
8

Towards a complete sequence homology concept: Limitations and applications

Wong, Wing-Cheong 14 December 2011 (has links) (PDF)
Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Since the matching of SPs/TMs creates the illusion of matching hydrophobic cores, the inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, this work shows explicit examples that the scores of clearly false-positive hits, even in globalmode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, this study finds that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. A workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users is provided. While E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences, it can also complicate the annotation problem. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We demonstrated that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g., 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value<0.1 when the EVD provides an E-value>0.1. Examples of false annotations are provided and the appropriateness of a logistic function as alternative to the EVD is critically discussed. This work shows that misguided E-value computation coupled with non-globular regions embedded in domain model library not only causes annotation errors in public databases but also limits the extrapolation power of protein function prediction tasks. So far, the preceding work has demonstrated that sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches from non-polar residues. We found that there are two types of transmembrane helices (TMs) in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intramembrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information. For extending the homologyconcept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments. Theoretical insights from this work were applied to problems of function prediction for specific uncharacterized gene/protein sequences (for example, APMAP and ARXES) and for the functional classification of TM-containing proteins.
9

Efficient Implementation & Application of Maximal String Covering Algorithm / MAXIMAL COVER ALGORITHM IMPLEMENTATION

Koponen, Holly January 2022 (has links)
This thesis describes the development and application of the new software MAXCOVER that computes maximal covers and non-extendible repeats (a.k.a. “maximal repeats”). A string is a finite array x[1..n] of elements chosen from a set of totally ordered symbols called an alphabet. A repeat is a substring that occurs at least twice in x. A repeat is left/right extendible if every occurrence is preceded/followed by the same symbol; otherwise, it is non-left/non-right extendible (NLE/NRE). A non-extendible (NE) repeat is both NLE and NRE. A repeat covers a position i if x[i] lies within the repeat. A maximal cover (a.k.a. “optimal cover”) is a repeat that covers the most positions in x. For simplicity, we first describe a quadratic O(n2) implementation of MAXCOVER to compute all maximal covers of a given string based on the pseudocode given in [1]. Then, we consider the logarithmic O(n log n) pseudocode in [1], in which we identify several errors. We leave a complete correction and implementation for future work. Instead, we propose two improved quadratic algorithms that, shown through experiments, will execute in linear time for the average case. We perform a benchmark evaluation of MAXCOVER’s performance and demonstrate its value to biologists in the protein context [2]. To do so, we develop an extension of MAXCOVER for the closely related task of computing NE repeats. Then, we compare MAXCOVER to the repeat-match feature of the well-known MUMmer software [3] (600+ citations). We determine that MAXCOVER is an order-of-magnitude faster than MUMmer with much lower space requirements. We also show that MAXCOVER produces a more compact, exact, and user-friendly output that specifies the repeats. Availability: Open source code, binaries, and test data are available on Github at https://github.com/hollykoponen/MAXCOVER. Currently runs on Linux, untested on other OS. / Thesis / Master of Science (MSc) / This thesis deals with a simple yet essential data structure called a string, a sequence of symbols drawn from an alphabet. For example, a DNA sequence is a string comprised of four letters. We describe a new software called MAXCOVER that identifies maximal covers of a given string x (a repeating substring that ‘covers’ the most positions in x). This software is based on the algorithms in [1]. We propose two new algorithms that perform faster in practice. We also extended MAXCOVER for the closely related task of computing non-extendible repeats. We compare this extension to the well-known MUMmer software (600+ citations). We find that MAXCOVER is many times faster than MUMmer with much lower space requirements and produces a more compact, exact and user-friendly output.
10

Engineering Proteins from Sequence Statistics: Identifying and Understanding the Roles of Conservation and Correlation in Triosephosphate Isomerase

Sullivan, Brandon Joseph January 2011 (has links)
No description available.

Page generated in 0.0557 seconds