1 |
Algorithms For Haplotype Inference And Block PartitioningVijaya, Satya Ravi 01 January 2006 (has links)
The completion of the human genome project in 2003 paved the way for studies to better understand and catalog variation in the human genome. The International HapMap Project was started in 2002 with the aim of identifying genetic variation in the human genome and studying the distribution of genetic variation across populations of individuals. The information collected by the HapMap project will enable researchers in associating genetic variations with phenotypic variations. Single Nucleotide Polymorphisms (SNPs) are loci in the genome where two individuals differ in a single base. It is estimated that there are approximately ten million SNPs in the human genome. These ten million SNPS are not completely independent of each other - blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs on a block of the chromosome is called a haplotype. Each block might contain a large number of SNPs, but a small subset of these SNPs are sufficient to uniquely dentify each haplotype in the block. The haplotype map or HapMap is a map of these haplotype blocks. Haplotypes, rather than individual SNP alleles are expected to effect a disease phenotype. The human genome is diploid, meaning that in each cell there are two copies of each chromosome - i.e., each individual has two haplotypes in any region of the chromosome. With the current technology, the cost associated with empirically collecting haplotype data is prohibitively expensive. Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called the haplotype inference problem. Many statistical approaches have been developed for the haplotype inference problem. Some of these statistical methods have been shown to be reasonably accurate on real genotype data. However, these techniques are very computation-intensive. With the international HapMap project collecting information from nearly 10 million SNPs, and with association studies involving thousands of individuals being undertaken, there is a need for more efficient methods for haplotype inference. This dissertation is an effort to develop efficient perfect phylogeny based combinatorial algorithms for haplotype inference. The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human genome due to the block structure of the human genome. An important contribution of this dissertation is an optimal O(nm) time algorithm for the PPH problem, where n is the number of genotypes and m is the number of SNPs involved. The complexity of the earlier algorithms for this problem was O(nm^2). The O(nm) complexity was achieved by applying some transformations on the input data and by making use of the FlexTree data structure that has been developed as part of this dissertation work, which represents all the possible PPH solution for a given set of genotypes. Real genotype data does not always admit a perfect phylogeny, even within a block of the human genome. Therefore, it is necessary to extend the perfect phylogeny approach to accommodate deviations from perfect phylogeny. Deviations from perfect phylogeny might occur because of recombination events and repeated or back mutations (also referred to as homoplasy events). Another contribution of this dissertation is a set of fixed-parameter tractable algorithms for constructing near-perfect phylogenies with homoplasy events. For the problem of constructing a near perfect phylogeny with q homoplasy events, the algorithm presented here takes O(nm^2+m^(n+m)) time. Empirical analysis on simulated data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 1000 times faster than phase. Another important problem while dealing real genotype or haplotype data is the presence of missing entries. The Incomplete Perfect Phylogeny (IPP) problem is to construct a perfect phylogeny on a set of haplotypes with missing entries. The Incomplete Perfect Phylogeny Haplotyping (IPPH) problem is to construct a perfect phylogeny on a set of genotypes with missing entries. Both the IPP and IPPH problems have been shown to be NP-hard. The earlier approaches for both of these problems dealt with restricted versions of the problem, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. We make some novel observations about these problems, and present efficient algorithms for unrestricted versions of these problems. The algorithms have worst-case exponential time complexity, but have been shown to be very fast on practical instances of the problem.
|
2 |
Haplotype Inference from Pedigree Data and Population DataLi, Xin January 2010 (has links)
No description available.
|
3 |
Associating genotype sequence properties to haplotype inference errorsROSA, Rogério dos Santos 12 March 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-03-16T15:28:47Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
RogerioSantosRosa_Tese.pdf: 1740026 bytes, checksum: aa346f64c34419c4b83269ccb99ade6a (MD5) / Made available in DSpace on 2016-03-16T15:28:48Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
RogerioSantosRosa_Tese.pdf: 1740026 bytes, checksum: aa346f64c34419c4b83269ccb99ade6a (MD5)
Previous issue date: 2015-03-12 / Haplotype information has a central role in the understanding and diagnosis of certain
illnesses, and also for evolution studies. Since that type of information is hard to obtain directly,
computational methods to infer haplotype from genotype data have received great attention
from the computational biology community. Unfortunately, haplotype inference is a very hard
computational biology problem and the existing methods can only partially identify correct
solutions. I present neural network models that use different properties of the data to predict
when a method is more prone to make errors. I construct models for three different Haplotype
Inference approaches and I show that our models are accurate and statistically relevant. The
results of our experiments offer valuable insights on the performance of those methods, opening
opportunity for a combination of strategies or improvement of individual approaches. I formally
demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators
of Switch Error tendency for four methods studied, and I delineate scenarios based on LD
measures, that reveal a higher or smaller propension of the HI methods to present inference
errors, so the correlation between LD and the occurrence of errors varies among regions along
the genotypes. I present evidence that considering windows of length 10, immediately to the
left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher’s
Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple
Linear Regression to explore the relevance of several biologically meaningful properties of the
genotype sequences for the accuracy of the haplotype inference results, developing models for
two databases (considering only Humans) and using two error metrics. The accuracy of our
results and the stability of our proposed models are supported by statistical evidence. / Haplótipos têm um papel central na compreensão e diagnóstico de determinadas doenças
e também para estudos de evolução. Este tipo de informação é difícil de obter diretamente,
diante disto, métodos computacionais para inferir haplótipos a partir de dados genotípicos têm
recebido grande atenção da comunidade de biologia computacional. Infelizmente, a Inferência
de Halótipos é um problema difícil e os métodos existentes só podem predizer parcialmente
soluções corretas. Foram desenvolvidos modelos de redes neurais que utilizam diferentes
propriedades dos dados para prever quando um método é mais propenso a cometer erros. Foram
calibrados modelos para três abordagens de Inferência de Haplótipos diferentes e os resultados
validados estatisticamente. Os resultados dos experimentos oferecem informações valiosas sobre
o desempenho e comportamento desses métodos, gerando condições para o desenvolvimento
de estratégias de combinação de diferentes soluções ou melhoria das abordagens individuais.
Foi demonstrado que Desequilíbrio de Ligação (LD) e heterozigosidade são fortes indicadores
de tendência de erro, desta forma foram delineados cenários com base em medidas de LD, que
revelam quando um método tem maior ou menor propensão de cometer erros. Foi identificado
que utilizando janelas de 10 SNPs (polimorfismo de um único nucleotídeo), imediatamente a
montante, e eliminando os SNPs não informativos pelo Teste de Fisher leva-se a uma correlação
mais adequada entre LD e a ocorrência de erros. Por fim, foi aplicada análise de Regressão Linear
para explorar a relevância de várias propriedades biologicamente significativas das sequências de
genótipos para a precisão dos resultados de Inferência de Haplótipos, estimou-se modelos para
duas bases de dados (considerando apenas humanos) utilizando duas métricas de erro. A precisão
dos resultados e a estabilidade dos modelos propostos foram validadas por testes estatísticos.
|
4 |
Two Optimization Problems in Genetics : Multi-dimensional QTL Analysis and Haplotype InferenceNettelblad, Carl January 2012 (has links)
The existence of new technologies, implemented in efficient platforms and workflows has made massive genotyping available to all fields of biology and medicine. Genetic analyses are no longer dominated by experimental work in laboratories, but rather the interpretation of the resulting data. When billions of data points representing thousands of individuals are available, efficient computational tools are required. The focus of this thesis is on developing models, methods and implementations for such tools. The first theme of the thesis is multi-dimensional scans for quantitative trait loci (QTL) in experimental crosses. By mating individuals from different lines, it is possible to gather data that can be used to pinpoint the genetic variation that influences specific traits to specific genome loci. However, it is natural to expect multiple genes influencing a single trait to interact. The thesis discusses model structure and model selection, giving new insight regarding under what conditions orthogonal models can be devised. The thesis also presents a new optimization method for efficiently and accurately locating QTL, and performing the permuted data searches needed for significance testing. This method has been implemented in a software package that can seamlessly perform the searches on grid computing infrastructures. The other theme in the thesis is the development of adapted optimization schemes for using hidden Markov models in tracing allele inheritance pathways, and specifically inferring haplotypes. The advances presented form the basis for more accurate and non-biased line origin probabilities in experimental crosses, especially multi-generational ones. We show that the new tools are able to reconstruct haplotypes and even genotypes in founder individuals and offspring alike, based on only unordered offspring genotypes. The tools can also handle larger populations than competing methods, resolving inheritance pathways and phase in much larger and more complex populations. Finally, the methods presented are also applicable to datasets where individual relationships are not known, which is frequently the case in human genetics studies. One immediate application for this would be improved accuracy for imputation of SNP markers within genome-wide association studies (GWAS). / eSSENCE
|
Page generated in 0.0151 seconds