Thesis: Ph. D., Massachusetts Institute of Technology, Department of Mathematics, 2017. / Cataloged from PDF version of thesis. / Includes bibliographical references (pages 187-197). / Disparate biological datasets often exhibit similar well-defined structure; efficient algorithms can be designed to exploit this structure. In this doctoral thesis, we present a framework for similarity search based on entropy and fractal dimension; here, we prove that a clustered search algorithm scales in time with metric entropy number of covering hyperspheres-if the fractal dimension is low. Using these ideas, entropy-scaling versions of standard bioinformatics search tools can be designed, including for small-molecule, metagenomics, and protein structure search. This 'compressive acceleration' approach taking advantage of redundancy and sparsity in biological data can be leveraged also for next-generation sequencing (NGS) read mapping. By pairing together a clustered grouping over similar reads and a homology table for similarities in the human genome, our CORA framework can accelerate all-mapping by several orders of magnitude. Additionally, we also present work on filtering empirical base-calling quality scores from Next Generation Sequencing data. By using the sparsity of k-mers of sufficient length in the human genome and imposing a human prior through the use of frequent k-mers in a large corpus of human DNA reads, we are able to quickly discard over 90% of the information found in those quality scores while retaining or even improving downstream variant-calling accuracy. This filtering step allows for fast lossy compression of quality scores. / by Yun William Yu. / Ph. D.
Identifer | oai:union.ndltd.org:MIT/oai:dspace.mit.edu:1721.1/112879 |
Date | January 2017 |
Creators | Yu, Yun William |
Contributors | Bonnie Berger., Massachusetts Institute of Technology. Department of Mathematics., Massachusetts Institute of Technology. Department of Mathematics. |
Publisher | Massachusetts Institute of Technology |
Source Sets | M.I.T. Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | 197 pages, application/pdf |
Rights | MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission., http://dspace.mit.edu/handle/1721.1/7582 |
Page generated in 0.0013 seconds