Return to search

New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

Great efforts have been devoted to decipher the sequence composition of
the genomes and transcriptomes of diverse organisms. Continuing advances in
high-throughput sequencing technologies have led to a decline in associated
costs, facilitating a rapid increase in the amount of available genetic data. In
particular genome studies have undergone a fundamental paradigm shift where
genome projects are no longer limited by sequencing costs, but rather by
computational problems associated with assembly. There is an urgent demand
for more efficient and more accurate methods. Most recently, “hybrid”
methods that integrate short- and long-read data have been devised to address
this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a
bipartite overlap graph between long reads and restrictively filtered short-read
unitigs. This graph is translated into a long-read overlap graph. By design,
unitigs are both unique and almost free of assembly errors. As a consequence,
only few spurious overlaps are introduced into the graph. Instead of the more
conventional approach of removing tips, bubbles, and other local features,
LazyB extracts subgraphs whose global properties approach a disjoint union of
paths in multiple steps, utilizing properties of proper interval graphs. A
prototype implementation of LazyB, entirely written in Python, not only yields
significantly more accurate assemblies of the yeast, fruit fly, and human
genomes compared to state-of-the-art pipelines, but also requires much less
computational effort. An optimized C++ implementation dubbed MuCHSALSA
further significantly reduces resource demands.
Advances in RNA-seq have facilitated tremendous insights into the role of
both coding and non-coding transcripts. Yet, the complete and accurate
annotation of the transciptomes of even model organisms has remained elusive.
RNA-seq produces reads significantly shorter than the average distance
between related splice events and presents high noise levels and other biases
The computational reconstruction remains a critical bottleneck.
Ryūtō implements an extension of common splice graphs facilitating the integration
of reads spanning multiple splice sites and paired-end reads bridging distant
transcript parts. The decomposition of read coverage patterns is modeled as a
minimum-cost flow problem. Using phasing information from multi-splice and
paired-end reads, nodes with uncertain connections are decomposed step-wise
via Linear Programming.
Ryūtōs performance compares favorably with
state-of-the-art methods on both simulated and real-life datasets. Despite
ongoing research and our own contributions, progress on traditional single
sample assembly has brought no major breakthrough. Multi-sample RNA-Seq
experiments provide more information which, however, is challenging to utilize
due to the large amount of accumulating errors. An extension to Ryūtō
enables the reconstruction of consensus transcriptomes from multiple RNA-seq
data sets, incorporating consensus calling at low level features. Benchmarks
show stable improvements already at 3 replicates.
Ryūtō outperforms competing approaches, providing a better and user-adjustable
sensitivity-precision trade-off. Ryūtō consistently improves assembly on
replicates, demonstrable also when mixing conditions or time series and for
differential expression analysis. Ryūtōs approach towards guided assembly is
equally unique. It allows users to adjust results based on the quality of the
guide, even for multi-sample assembly.:1 Preface
1.1 Assembly: A vast and fast evolving field
1.2 Structure of this Work
1.3 Available
2 Introduction
2.1 Mathematical Background
2.2 High-Throughput Sequencing
2.3 Assembly
2.4 Transcriptome Expression

3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background
3.2 Strategy
3.3 Data preprocessing
3.4 Processing of the overlap graph
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking
3.7 MuCHSALSA – Moving towards the future

4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly
4.1 Background
4.2 Strategy
4.3 The Ryūtō core algorithm
4.4 Improved Multi-sample transcript assembly with Ryūtō

5 Conclusion & Future Work
5.1 Discussion and Outlook
5.2 Summary and Conclusion

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:78099
Date18 February 2022
CreatorsGatter, Thomas
ContributorsUniversität Leipzig
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/updatedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0145 seconds