08 February 2008
Cheaper and more rapid DNA sequencing has led to the accumulation of large amounts of genetic data and has fueled the development of new methods to analyze this data. Using population genetics theory and computational methods we can explore the evolutionary forces that shape genetic variation within and among populations of humans and malaria parasites. Demographic events such as population size change influence current patterns of genetic variation. Accounting for the demographic history of a population is critical in the interpretation of population genetic analyses, particularly in detecting of regions under selection and in making inferences about linkage disequilibrium. Characterizing how recombination rates evolve is critical for the efficient design of association studies and, in turn, the understanding of the genetics behind complex phenotypes. In malaria parasites, recombination is a key element in the creation of a wide array of antigens, which help invade host cells. We examine patterns of genetic variation in humans and malaria and explore how demographic history and recombination rates affect these patterns.
30 March 2008
The International HapMap Project and high- throughput genotyping technology have generated millions of genome-wide marker data that can be used in genetic studies. Each marker can be analyzed separately. But analyzing multiple markers simultaneously through haplotypes has generated great interest recently. Understanding the haplotype structure in the human genome may provide important information on human evolutionary history and identification of genetic variants responsible for human complex diseases. Since the alleles at closely linked markers on a single chromosome are often in statistical dependence (i.e. linkage disequilibrium (LD)), one crucial aspect of haplotype analysis is to characterize LD patterns in different regions and different populations. To assess the extent of correlation of genetic variation at multiple markers in a given region and a population, pairwise LD measures such as and have been commonly used. However, pairwise LD measures alone may be suboptimal to effectively capture the variability of background levels of disequilibrium since multilocus LD measures can provide information about simultaneous allele associations among multiple loci which pairwise LD measures miss. In addition, in order to fully characterize the haplotype structure and LD pattern at multiple markers, it is necessary to consider high order disequilibria and estimate their values.
13 April 2009
Transcription factors (TFs) have been characterized as mediators of human complex disease processes. The target genes of TFs also may be associated with disease. Identification of potential TF targets could further our understanding of gene-gene interactions underlying complex disease. We focused on two TFs, USF1 and ZNF217, because of their biological importance, especially their known genetic association with coronary artery disease (CAD), and the availability of chromatin immunoprecipitation microarray (ChIP-chip) results. First, we used USF1 ChIP-chip data as a training dataset to develop and evaluate several kernel logistic regression prediction models. Our most accurate predictor significantly outperformed standard PWM-based prediction methods. This novel prediction method enables a more accurate and efficient genome-scale identification of USF1 binding and associated target genes. Second, the results from independent linkage and gene expression studies suggest that ZNF217 also may be a candidate gene for CAD. We further investigated the role of ZNF217 for CAD in three independent CAD samples with different phenotypes. Our association studies of ZNF217 identified three SNPs having consistent association with CAD in three samples. Aorta expression profiling indicated that the proportion of the aorta with raised lesions was also positively correlated to ZNF217 expression. The combined evidence suggests that ZNF217 is a novel susceptibility gene for CAD. Finally, we applied our previously developed TF binding site (TFBS) prediction method to ZNF217. The performance of the prediction models of ZNF217 and USF1 are very similar. We demonstrated that our TFBS prediction method can be extended to other TFs. In summary, the results of this dissertation research are (1) evaluation of two TFs, USF1 and ZNF217, as susceptibility factors for CAD; (2) development of a generalized method for TFBS prediction; (3) prediction of TFBSs and target genes of two TFs, and identification of SNPs within TFBSs. This research allows for the development of study design to access TF based interactions in genetic susceptibility to human complex disease.
Mannino, Frank Vincent
28 April 2006
The ability to realistically model gene evolution improved dramatically with the rejection of the assumption that rates are constant across sites. Rate heterogeneity models allow for better estimates of parameters and site specific inferences such as the detection of positive selection. Recently developed models of codon evolution allow for both synonymous and nonsynonymous rates to vary independently according to discretized gamma distributions. I applied this model to mitochondrial genomes and concluded that synonymous rate variation is present in many genes, and is of appreciable magnitude relative to the amount of nonsynonymous heterogeneity. I then extending this model to allow for the two rates to vary according to a dependent bivariate distribution, permitting tests for the significance of correlation of rates within a gene. I present here the algorithm to discretize this bivariate distribution and the application of the model to many real data sets. Significant correlation between synonymous and nonsynonymous rates exists in roughly half of the data sets that I examined, and the correlation is typically positive. These data sets range over a wide group of taxa and genes, implying that the trend of correlation is general. Finally, I performed a thorough investigation of the statistical properties of using discretized gamma distributions to model rate variation, looking at the bias and variance in parameter estimates. These discretized distributions are common in modeling heterogeneity, but have weaknesses that must be well understood before making inferences.
Keebler, Jonathan Edward Myers
20 April 2010
Recent technological advances have made high-throughput DNA sequencing a routine laboratory experiment. This progression in technology has been made possible by the parallel production of millions of short fragments of sequence. The responsibility of garnering biological information from these DNA fragments has shifted from the wet-lab to the bioinformatician. As sequencing technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donorsâ genotypes, a task that is not necessarily trivial using high-throughput sequencing reads. A violation of Mendelian inheritance laws observed amid the resequenced genomes of family members can indicate the presence of a de novo mutation. A method for locating de novo mutations by probabilistically inferring genotypes across a pedigree using high-throughput sequencing is presented and applied to two resequenced nuclear families: one as a collaborative effort within The 1,000 Genomes Project, and the second in an attempt to discover candidate driver and passenger mutations within the genome of an Acute Lymphoblastic Leukemia. The mutation findings within these projects are presented, and the approach is examined in detail, highlighting areas where method improvements may be made. Considering the challenges experienced in these studies within the larger context of the nascent field of Personal Genomics, an honest assessment is presented of developments that must be made before the application of whole-genome sequencing on the scale of an individual human can unequivocally be used to predict, diagnose, or treat human disease.
Strain, Errol Alan
07 August 2006
The current dissertation looks at the molecular evolution of protein-coding genes in the flowering plant Arabidopsis thaliana and within two RNA viruses, humanimmunodeficiency virus (HIV) and Astroviridae. We analyzed members of the receptor-like kinase (RLK) gene family in Arabidopsis thaliana for positive selection. Likelihood analysis found evidence for positive selection in 12 of the 52 RLK family sequences groups. These 12 groups represent 97 of the 403 sequences analyzed. The majority of genes in groups subject to positive selection have not been functionally characterized, but sites under selection are predominantly located in the extracellular region. In HIV we use Akaike Information Criteria (AIC) based model averaging for models of nucleotide evolution to examine estimates of genetic distance and the ratio of transition/transversion (ts/tv). AIC weighted estimates of distance and ts/tv were shown to be robust relative to model assumptions. AIC weighted estimates of the ts/tv ratio in simulated HIV sequences generally had less variance than similar estimates made by selecting the single best scoring AIC model. Astroviruses are a leading cause of viral gastroenteritis in infants worldwide and little is known about the mechanisms of astrovirus-induced diarrhea or the virally encoded components responsible for disease. We report the genomic sequence of nine novel TAstV-2 isolates. Nucleotide and amino acid identities for the isolates were generally > 90% conserved. Phylogenies constructed using genomic RNA and the individual open reading frames (ORF) provide evidence for recombination and indicate differences in substitution rates between non-structural and structural genes. Analysis of the viral capsid genes using codon models of evolution indicate site-specific positive selection in both turkey and human astroviruses.
04 August 2008
One of the most important tasks in human genetics is to search for disease susceptibility genes. Linkage and association analyses are two major approaches for disease-gene mapping. Chapter 1 reviewed the development of disease-gene mapping methods in the past decades. Gene mapping of complex human diseases often results in the identification of multiple potential risk variants within a gene and/or in the identification of multiple genes within a linkage peak. Thus a question of interest is to test whether the linkage result can be explained in part or in full by the candidate SNP if it shows evidence of association, and then provide some guidance for the next time-consuming step of positional cloning of susceptibility genes. Two methods, GIST and LAMP, which access whether the SNP can partially or fully account for the linkage signal in the region identified by a linkage scan, are evaluated on Genetic Analysis Workshop 15 (GAW15) simulated rheumatoid arthritis (RA) data and discussed in Chapter 2. The simulation results showed that GIST is simple and works slightly better than LAMP-LE test when there is little linkage evidence, LAMP linkage test has limited power when there is not much linkage evidence, and LAMP association test is the best not only when the linkage evidence is extremely high, but also when there is some LD between the candidate SNP and the trait locus. The fact that complex traits are often determined by multiple genetic and environmental factors with small-to-moderate effects makes it important to investigate the behavior of current association methods under multiple risk variants model. In Chapter 3, we compared APL, FBAT, LAMP, APL-Haplotype, FBAT-LC and APL-OSA conditional test in five multiple risk variants models. The simulation results showed that the power of single marker association tests is closely correlated with the amount of LD between marker and disease loci, and these tests maintain good power to detect multiple risk variants in a small region with moderate degree of LD for fully genotyped families. Global tests, such as FBAT-LC are sensitive to the presence of at least one susceptibility variant, but are not helpful for selecting the most promising SNPs for further study. We reported that if multiple haplotypes are associated with different disease loci, the haplotype tests results can be misleading while APL-OSA conditional test has the greatest power to properly dissect the clustered associated markers for all models with an acceptable type I error rate ranging from 0.033 to 0.056. We applied APL-OSA conditional test on GENECARD samples, and got reasonable results. One linkage region of particular interest on chromosome 3 was identified by two independent genome linkage scan with Coronary Artery Disease (CAD). Multiple disease susceptibility genes have been reported from this region, and there are also linkage evidence that this region may harbors a gene or genes determining HDL-C levels. Within this region, a search for HDL-C QTL and analyses of the relationship between genetic variants, HDL-C level to CAD risk are discussed in Chapter 4. We performed CAD association and HDL-C QTL analysis on two independent datasets. We identified SNP rs2979307 in the OSBPL11 gene which survives a Bonferroni correction. We observed different HDL-C trends with HDL-C associated SNPs. Even with the evident heterogeneity presented in our CAD population, we detected several association signals with SNPs in KALRN, MYLK, CDGAP and PAK2 genes in both CAD datasets for HDL-C, where all these genes belong to a Rho pathway.
11 August 2009
Alternative splicing (AS) is an important post-transcriptional mechanism that increases protein diversity and may affect mRNA stability and translaftion efficiency. Despite its importance, our knowledge about its mechanism and regulation is very limited. Although it is known that the regulation of AS is influenced by multiple factors, most previous studies have focused on analyzing an individual regulator. In this dissertation, we apply three types of association rule mining techniques to discover cis-regulatory motifs or motif groups that are associated with specific AS patterns in mouse. General association rule mining for categorical attributes is used to find âmotif=>motifâ rules in gene groups that show similar exon skipping patterns. This method provides candidates for interacting motifs. Discretization-based and distribution-based quantitative association rule mining techniques are used to find âmotif => exon skipping profileâ rules. Many of the discovered motif candidates coincide with known splicing factor binding sites. Our ultimate goal is to find motifs and motif combinations that are involved in the dynamic regulation of AS. Based on our observations we hypothesize that some cis-regulatory elements affect AS only in combination with other elements. Interacting motifs show interesting differences to motifs that act individually. For example, interacting motif pairs are more conserved, they occur on average closer to the splice sites, motif pairs derived from distribution-based association rule mining, occur also in higher multiplicity. Based on these observations, we hypothesize that interacting cis-regulatory motifs might often correspond to weaker binding sites that occur in clusters close to the regulated splice sites.
18 July 2003
Genetic analysis of molecular markers has allowed biologists to ask a wide variety of questions. This dissertation explores some aspects of the statistical and computational issues used in the genetic marker data analysis. Chapter 1 gives an introduction to genetic marker data, as well as a brief description to each chapter. Chapter 2 presents the different genetic analyses performed on a large data set and discusses the use of microsatellites to describe the maize germplasm and to improve maize germplasm maintenance. Considerable attention is focused on how the maize germplasm is organized and genetic variation is distributed. A novel maximum likelihood method is developed to estimate the historical contributions for maize inbred lines. Chapter 3 covers a new method for optimal selection of a core set of lines from a large germplasm collection. The simulated annealing algorithm for choosing an optimal k-subset is described and evaluated using the maize germplasm as an example; general constraints are incorporated in the algorithm, and the efficiency of the algorithms is compared to existing methods. Chapter 4 covers a two-stage strategy to partition a chromosomal region into blocks with extensive within-block linkage disequilibrium, and to select the optimal subset of SNPs that essentially captures the haplotype variation within a block. Population simulations suggest that the recursive bisection algorithm for block partitioning is generally reliable for recombination hotspots identification. Maximal entropy theory is applied to choose optimal subset of SNPs. The procedures are evaluated analytically as well as by simulation. The final chapter covers a new software package for genetic marker data analysis. The methods implemented in the package are listed. A brief tutorial is included to illustrate the features of the package. Chapter 5 also describes a new method for estimating population specific F-statistics and an extended algorithm for estimating haplotype frequencies.
21 August 2008
Disease gene fine mapping is an important task in human genetic research. Association analysis is becoming a primary approach for localizing disease loci, especially when abundant SNPs are available due to the well improved genotyping technology during the last decades. Despite the rapid improvement of detection ability, there are many limitations of association strategy. In this dissertation, we focused on three different topics including haplotype similarity based test, association test incorporating genotyping error and simulation tool for large data set. 1) Previous haplotype similarity based tests donât have the ability to incorporate covariates in the test. In chapter 2, we proposed a new association method based on haplotype similarity that incorporates covariates and utilizes maximum amount of data information. We found that our method gives power improvement when neither LD nor allele frequency is too low and is comparable under other scenarios. 2) In chapter 3, we proposed a new strategy that incorporates the genotyping uncertainty to assess the association between traits and SNPs. Extensive simulation studies for case-control designs demonstrated that intensity information based association test can reduce the impact induced by genotyping error. 3) In chapter 4, we described simulation software, SimuGeno, which is used to simulate large scale genomic data for case-control association studies.
Page generated in 0.0947 seconds