Return to search

Computationally intensive methods for hidden Markov models with applications to statistical genetics

In most fields of technology and science, the exponential increase of available data is an apparent trend. In genetics, the main contributor to this trend is the improving efficiency of sequencing technologies. While the Human Genome project focused on assembling a single reference sequence not long ago, now there are aims to sequence million genomes in upcoming projects. The consequent computational challenge is being able to utilise this wealth of data, which requires the development of sufficiently powerful methods for analysis. However, the speed of transistor-based computing processors has recently hit a power ceiling and developers can no longer rely on hardware improvements automatically providing performance improvements in software directly. The result is that analysis methods are failing to keep up with the speed of data generation, and at this age of exponential data explosion it is becoming critical to find any solution for improving the performance of statistical methods. One traditional approach is to apply approximations - often trading the quality of results for response time. Another approach is to achieve algorithmic optimisations for existing methods without sacrificing results. Unfortunately, the possibilities for purely algorithmic optimisations often tend to be limited. A third approach is to attempt to harness the computational power of the presently re-emerging field of parallel computing. While the theoretical performance of parallel platforms roughly follows Moore's law, exploiting the power of parallelisms requires significant effort during development and may not even be possible in certain applications. This work attempts to explore avenues for achieving high performance for Hidden Markov Models (HMMs) and HMM applications in population genetics. The second chapter of this thesis introduces a single-locus variant of the IMPUTE2 method for calling and phasing genotype variants based on genotype likelihood data. This method uses both approximations and algorithmic optimisations and achieves performance improvements without a considerable drop in accuracy. It is also aimed to be highly parallelisable. The third chapter presents GPGPU-focused parallelisation methods over the statespace for HMM algorithms specifically under the Li and Stephens model, which is a widely and successfully used approximation of the coalescent. Practical experiments show ×200-×6000 times acceleration with a CUDA implementation of the popular Chromopainter method, which is based on the Li and Stephens model. The last chapter explores the theoretical possibility of parallelising HMM algorithms across blocks of observations (inspired by but not limited to methods used in genetics). A novel view and derivation is presented for block parallelism, along with accompanying analyses of applicability and relevance. Performance analysis results indicate that the application of block-parallelism is expected to be highly relevant for most large-scale HMM applications on present-day computing platforms, while block-parallelism may become a necessity for utilising the improving power of parallel hardware in the close future.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:711745
Date January 2014
CreatorsKecskemetry, Peter D.
ContributorsHolmes, Chris
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttps://ora.ox.ac.uk/objects/uuid:8dd5d68d-27e9-4412-868c-0477e438a2c5

Page generated in 0.0019 seconds