Global ETD Search

1	Improving algorithms of gene prediction in prokaryotic genomes, metagenomes, and eukaryotic transcriptomes Tang, Shiyuyun 27 May 2016 (has links) Next-generation sequencing has generated enormous amount of DNA and RNA sequences that potentially carry volumes of genetic information, e.g. protein-coding genes. The thesis is divided into three main parts describing i) GeneMarkS-2, ii) GeneMarkS-T, and iii) MetaGeneTack. In prokaryotic genomes, ab initio gene finders can predict genes with high accuracy. However, the error rate is not negligible and largely species-specific. Most errors in gene prediction are made in genes located in genomic regions with atypical GC composition, e.g. genes in pathogenicity islands. We describe a new algorithm GeneMarkS-2 that uses local GC-specific heuristic models for scoring individual ORFs in the first step of analysis. Predicted atypical genes are retained and serve as ‘external’ evidence in subsequent runs of self-training. GeneMarkS-2 also controls the quality of training process by effectively selecting optimal orders of the Markov chain models as well as duration parameters in the hidden semi-Markov model. GeneMarkS-2 has shown significantly improved accuracy compared with other state-of-the-art gene prediction tools. Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) provides large amount of RNA reads that can be assembled to full transcriptome. We have developed a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. Unsupervised estimation of parameters of the algorithm makes unnecessary several steps in the conventional gene prediction protocols, most importantly the manually curated preparation of training sets. We have demonstrated that the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting gene starts compares favorably to other existing methods. Frameshift prediction (FS) is important for analysis and biological interpretation of metagenomic sequences. Reads in metagenomic samples are prone to sequencing errors. Insertion and deletion errors that change the coding frame impair the accurate identification of protein coding genes. Accurate frameshift prediction requires sufficient amount of data to estimate parameters of species-specific statistical models of protein-coding and non-coding regions. However, this data is not available; all we have is metagenomic sequences of unknown origin. The challenge of ab initio FS detection is, therefore, twofold: (i) to find a way to infer necessary model parameters and (ii) to identify positions of frameshifts (if any). We describe a new tool, MetaGeneTack, which uses a heuristic method to estimate parameters of sequence models used in the FS detection algorithm. It was shown on several test sets that the performance of MetaGeneTack FS detection is comparable or better than the one of earlier developed program FragGeneScan. Gene prediction Genome annotation Prokaryotic genomes Ribosomal binding site Hidden Markov models Adaptive training Unsupervised self-training Heuristic models RNA-Seq RNA transcripts Frameshift prediction Metagenomics
2	Rational and combinatorial genetic engineering approaches for improved recombinant protein production and purification Bandmann, Nina January 2007 (has links) The bacterium Escherichia coli (E. coli) is in many situations an ideal host for production of recombinant proteins, since it generally provides a rapid and economical means to achieve sufficiently high product quantities. However, there are several factors that may limit this host’s ability to produce large amounts of heterologous proteins in a soluble and native form. For many applications a high purity of the recombinant protein is demanded, which implies a purification strategy where the product efficiently can be isolated from the complex milieu of host cell contaminants. In this thesis, different strategies based on both rational and combinatorial genetic engineering principles have been investigated, aiming at improving and facilitating recombinant E. coli protein production and purification. One objective was to improve the PEG/salt aqueous two-phase system (ATPS) purification process of the lipase cutinase, by increasing the selectivity of the protein for the system top-phase. Peptide tags, with varying properties, were designed and genetically fused to the C-terminal end of ZZ-cutinase. Greatly increased partitioning values were observed for purified protein variants fused to tryptophan containing peptide tags, particularly a (WP)4 peptide. The partitioning properties of the ZZ-cutinase-(WP)4 protein were also retained when added to the ATPS directly from an E. coli total cell disintegrate, emphasizing the applicability of this genetic engineering strategy for primary protein purification in ATPSs. Further on, a combinatorial library approach using phage display technology was investigated as a tool for identification of peptide tags capable of improving partitioning properties of ZZ-cutinase in an ATPS. Repeated ATPS-based partitioning-selection cycles of a large phagemid (pVIII) peptide library, resulted in isolation of phage particles preferentially decorated with peptides rich in tyrosine and proline residues. Both a peptide corresponding to a phage library derived peptide sequence as well as peptides designed based on information of amino acid appearance frequencies in later selection rounds, were shown to improve partitioning several-fold when genetically fused to the C-terminal end of ZZ-cutinase. From the two- to four–fold increased production yields observed for these fusion proteins compared to ZZ-cutinase-(WP)4, it was concluded that the selection system used allowed for selection of desired peptide properties related to both partitioning and E. coli protein production parameters. Bacterial protein production is affected by several different mRNA and protein sequence-related features. Attempts to address single parameters in this respect are difficult due to the inter-dependence of many features, for example between codon optimization and mRNA secondary structure effects. Two combinatorial expression vector libraries (ExLib1 and ExLib2) were constructed using a randomization strategy that potentially could lead to variations in many of these sequence-related features and which would allow a pragmatic search of vector variants showing positive net effects on the level of soluble protein production. ExLib1 was constructed to encode all possible synonymous codons of an eight amino acid N-terminal extension of protein Z, fused to the N-terminal of an enhanced green fluorescent reporter protein (EGFP). In ExLib2, the same eight positions were randomized using an (NNG/T) degeneracy code, which could lead to various effects on both the nucleotide and protein level, through the introduction of nucleotide sequences functional as e.g. alternative ribosome binding or translation initiation sites or as translated codons for an Nterminal extension of the target protein by a peptide sequence. Flow cytometric analyses and sorting of library cell cultures resulted in isolation of clones displaying several-fold increases in whole cell fluorescence compared to a reference clone. SDS-PAGE and western blot analyses verified that this was a result of increases (up to 24-fold) in soluble intracellular ZEGFP product protein content. Both position specific codon bias effects and the appearance of new ribosomal binding sites in the library sequences were concluded to have influenced the protein production. To explore the possibility of applying the same combinatorial library strategy for improving soluble intracellular production of heterologous proteins proven difficult to express in E. coli, three proteins with either bacterial (a transcriptional regulator (DntR)) or human (progesterone receptor ligand binding domain (PRLBD) and 11-β Hydroxysteroid dehydrogenase type I (11-β)) origin, were cloned into the ExLib2 library. Flow cytometric sorting of libraries resulted in isolation of DntR library clones showing increased soluble protein production levels and PR-LBD library clones with up to ten-fold increases in whole cell fluorescence, although the product under these conditions co-separated with the insoluble cell material. / QC 20100623 Aqueous two-phase system combinatorial library expression vector flow cytometry fusion tag partitioning peptide library phage display recombinant proteins ribosomal binding site translation initiation Bioengineering Bioteknik

Search results

Improving algorithms of gene prediction in prokaryotic genomes, metagenomes, and eukaryotic transcriptomes

Rational and combinatorial genetic engineering approaches for improved recombinant protein production and purification