Return to search

High quality gene annotation for deep phylogenetic analysis

Gene prediction in newly sequenced genomes is a known challenging. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple very similar paralogs are present.
The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds rather than to chromosomes. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein-coding genes are needed in particular for phylogenetics and for the analysis of gene family histories.
In this dissertation, I established a tool, the ExonMatchSolver-pipeline (EMS-pipeline), that can assist the assembly of genes distributed across multiple fragments (e.g. contigs). The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. The EMS-pipeline accommodates a homology search step with a protein input set consisting of several highly similar paralogs as query. The core of the pipeline uses an Integer Linear Programming Implementation to solve the paralog-to-contig assignment problem. An extension to the initial implementation estimates the number of paralogs encoded in the target genome and can handle several paralogs that are situated on the same genomic fragment.
The EMS-pipeline was successfully applied to simulated data, several showcase examples and to deuterostome genomes in a large scale study on the evolution of the arrestin protein family. Especially at high genome fragmentation levels, the tool outperformed a naive assignment method.
Arrestins are key signaling transducers that bind to activated and phosphorylated G protein-coupled receptors and can mediate their endocytosis into the cell. The refined annotations of arrestins resulting from the application of the EMS-pipeline are more complete and accurate in comparison to a conventional database search strategy. With the applied strategy it was possible to map the duplication- and deletion history of arrestin paralogs including tandem duplications, pseudogenizations and the formation of retrogenes in detail.
My results support the emergence of the four arrestin paralogs from a visual and a non-visual proto-arrestin. Surprisingly, the visual ARR3 was lost in the mammalian clades afrotherians and xenarthrans. Segmental duplications in specific clades and the 3R-WGD in the teleost stem lineage, on the other hand, must have given rise to new
paralogs that show signatures of diversification in functional elements important for receptor binding and phosphate sensing. The four vertebrate orthology groups show an interesting pattern of divergence of three endocytosis motifs: the minor and major clathrin binding site and the adapter protein-2 (AP-2) binding motif.
Identification of such signatures, of residues that determine specificity between paralogs and are positively selected after duplication was made possible by high quality alignments obtained by genome inquiries, dense species sampling and consideration of fragmented loci from poorly assembled genomes in the framework of the EMS-pipeline, that was established in this dissertation.:1 Introduction 2
1.1 Basics and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 What is a gene? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 What is a tree in phylogenetics? . . . . . . . . . . . . . . . . . . 3
1.1.3 What are paralogs and orthologs? . . . . . . . . . . . . . . . . . 4
1.1.4 Central dogma in molecular biology: From DNA to protein . . 5
1.2 Gene duplications as evolutionary playground . . . . . . . . . . . . . . 12
1.2.1 Mechanisms of gene duplication . . . . . . . . . . . . . . . . . . 13
1.2.2 Evolutionary fate of duplicated genes . . . . . . . . . . . . . . . 14
1.3 Identification and annotation of protein homologs . . . . . . . . . . . . 15
1.3.1 Challenges of existing resources . . . . . . . . . . . . . . . . . . 16
1.3.2 Similarity search approaches without consideration of the gene
structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Gene structure aware gene annotation approaches . . . . . . . . 19
1.3.4 Graph-based inference of orthology relationships . . . . . . . . 21
1.3.5 Chance and challenge of fragmented assemblies . . . . . . . . . 21
1.4 Applied phylogenetic methods . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Phylogenetic inference in a nutshell . . . . . . . . . . . . . . . . 23
1.4.2 Inference of natural selection in inter-species data sets . . . . . 29
1.4.3 Detection of specificity determining positions . . . . . . . . . . 32
1.5 Multi-talents in cell signaling: The cytosolic arrestin proteins . . . . . . 34
1.5.1 Functions of arrestins in cell signaling . . . . . . . . . . . . . . . 34
1.5.2 Arrestin activation by GPCR binding . . . . . . . . . . . . . . . 36
1.5.3 Functions of arrestins in cellular trafficking . . . . . . . . . . . . 37
1.5.4 Evolution of arrestins . . . . . . . . . . . . . . . . . . . . . . . . 39
2 The ExonMatchSolver-pipeline 42
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.1 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.2 Exon assembly as an assignment problem . . . . . . . . . . . . . 43
2.2.3 Solving the Paralog-to-Contig Assignment Problem . . . . . . . 46
2.2.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.5 Implementation and usage . . . . . . . . . . . . . . . . . . . . . 48
2.2.6 Performance assessment by simulations . . . . . . . . . . . . . . 50
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Performance on simulated data . . . . . . . . . . . . . . . . . . . 50
2.3.2 Performance on real data - Two Showcase Examples . . . . . . . 51
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Evolution of the arrestin protein family in deuterostomes 61
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Database scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.2 Detailed gene annotation . . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Data resources used in the current study . . . . . . . . . . . . . 64
3.2.4 Alignment and building of phylogenetic trees . . . . . . . . . . 64
3.2.5 Identification of specificity determining positions . . . . . . . . 65
3.2.6 Testing for natural selection . . . . . . . . . . . . . . . . . . . . . 66
3.2.7 Assessement of conservation . . . . . . . . . . . . . . . . . . . . 66
3.2.8 Parsimonious reconstruction of exon gain and loss events . . . 67
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.1 Evolution of the arrestin fold family based on database inquiries 67
3.3.2 The refined arrestin annotations are more complete than database
entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Arrestin paralog gain and loss patterns based on the refined
annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Evolution of arrestin functional elements . . . . . . . . . . . . . 88
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.1 Limitation of arrestin database annotations . . . . . . . . . . . . 96
3.4.2 Arrestins in early vertebrate evolution . . . . . . . . . . . . . . . 98
3.4.3 Sub- and neofunctionalization as consequence of the 3R-WGD . 102
3.4.4 Independent arrestin duplications in deuterostomes . . . . . . . 104
3.4.5 Loss of arrestin paralogs in different vertebrate orders . . . . . 106
3.4.6 Previously unknown interaction partners and isoforms . . . . . 108
4 Improvements on the ExonMatchSolver-pipeline 110
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2.1 Estimation of the paralog number . . . . . . . . . . . . . . . . . 111
4.2.2 Subdivision of gene loci on the same contig . . . . . . . . . . . . 113
4.2.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.4 Assessment of the ExonMatchSolver-pipeline Version 2 . . . 115
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5 Conclusion and Outlook 119
A Additional figures 123
B Additional tables 134
C CV 152
Bibliography 156

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:31336
Date27 August 2018
CreatorsIndrischek, Henrike
ContributorsUniversität Leipzig
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageGerman
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/acceptedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation10.1186/s12862-017-1001-4, 10.1186/s13015-016-0063-y, 10.1186/s12862-018-1147-8

Page generated in 0.0034 seconds