11 August 2009
Alternative splicing (AS) is an important post-transcriptional mechanism that increases protein diversity and may affect mRNA stability and translaftion efficiency. Despite its importance, our knowledge about its mechanism and regulation is very limited. Although it is known that the regulation of AS is influenced by multiple factors, most previous studies have focused on analyzing an individual regulator. In this dissertation, we apply three types of association rule mining techniques to discover cis-regulatory motifs or motif groups that are associated with specific AS patterns in mouse. General association rule mining for categorical attributes is used to find âmotif=>motifâ rules in gene groups that show similar exon skipping patterns. This method provides candidates for interacting motifs. Discretization-based and distribution-based quantitative association rule mining techniques are used to find âmotif => exon skipping profileâ rules. Many of the discovered motif candidates coincide with known splicing factor binding sites. Our ultimate goal is to find motifs and motif combinations that are involved in the dynamic regulation of AS. Based on our observations we hypothesize that some cis-regulatory elements affect AS only in combination with other elements. Interacting motifs show interesting differences to motifs that act individually. For example, interacting motif pairs are more conserved, they occur on average closer to the splice sites, motif pairs derived from distribution-based association rule mining, occur also in higher multiplicity. Based on these observations, we hypothesize that interacting cis-regulatory motifs might often correspond to weaker binding sites that occur in clusters close to the regulated splice sites.
18 July 2003
Genetic analysis of molecular markers has allowed biologists to ask a wide variety of questions. This dissertation explores some aspects of the statistical and computational issues used in the genetic marker data analysis. Chapter 1 gives an introduction to genetic marker data, as well as a brief description to each chapter. Chapter 2 presents the different genetic analyses performed on a large data set and discusses the use of microsatellites to describe the maize germplasm and to improve maize germplasm maintenance. Considerable attention is focused on how the maize germplasm is organized and genetic variation is distributed. A novel maximum likelihood method is developed to estimate the historical contributions for maize inbred lines. Chapter 3 covers a new method for optimal selection of a core set of lines from a large germplasm collection. The simulated annealing algorithm for choosing an optimal k-subset is described and evaluated using the maize germplasm as an example; general constraints are incorporated in the algorithm, and the efficiency of the algorithms is compared to existing methods. Chapter 4 covers a two-stage strategy to partition a chromosomal region into blocks with extensive within-block linkage disequilibrium, and to select the optimal subset of SNPs that essentially captures the haplotype variation within a block. Population simulations suggest that the recursive bisection algorithm for block partitioning is generally reliable for recombination hotspots identification. Maximal entropy theory is applied to choose optimal subset of SNPs. The procedures are evaluated analytically as well as by simulation. The final chapter covers a new software package for genetic marker data analysis. The methods implemented in the package are listed. A brief tutorial is included to illustrate the features of the package. Chapter 5 also describes a new method for estimating population specific F-statistics and an extended algorithm for estimating haplotype frequencies.
21 August 2008
Disease gene fine mapping is an important task in human genetic research. Association analysis is becoming a primary approach for localizing disease loci, especially when abundant SNPs are available due to the well improved genotyping technology during the last decades. Despite the rapid improvement of detection ability, there are many limitations of association strategy. In this dissertation, we focused on three different topics including haplotype similarity based test, association test incorporating genotyping error and simulation tool for large data set. 1) Previous haplotype similarity based tests donât have the ability to incorporate covariates in the test. In chapter 2, we proposed a new association method based on haplotype similarity that incorporates covariates and utilizes maximum amount of data information. We found that our method gives power improvement when neither LD nor allele frequency is too low and is comparable under other scenarios. 2) In chapter 3, we proposed a new strategy that incorporates the genotyping uncertainty to assess the association between traits and SNPs. Extensive simulation studies for case-control designs demonstrated that intensity information based association test can reduce the impact induced by genotyping error. 3) In chapter 4, we described simulation software, SimuGeno, which is used to simulate large scale genomic data for case-control association studies.
Starmer, Joshua Mr.
02 November 2006
Molecular biologists have been observing interactions between messenger RNA (mRNA) molecules and other non-coding RNA molecules for quite some time. Here I revisit some of the classical hybridizations between the 16S ribosomal RNA (rRNA) and mRNA during initiation, as well as investigate the interactions between small interfering RNA (siRNA) molecules and mRNA. In reviewing rRNA-mRNA interactions, I observed that the majority of both bacterial and eukaryote genes can bind at the start codon. This novel result lead to a method for improving genome annotation as well as a new theory of translation initiation. The examination of siRNA-mRNA interactions lead to new criteria for predicting an siRNA's efficacy.
30 November 2004
Disease gene mapping is one of the main focuses of genetic epidemiology and statistical genetics. This dissertation explores some methods and algorithms in this area, especially in pedigrees. The first chapter gives an introduction to human genetics and disease gene mapping. Existing linkage and association methods are introduced and compared. Probabilities of genotypic data from multiple linked marker loci on related individuals are used as likelihoods of gene locations for gene-mapping, or as likelihoods of other parameters of interest in human genetics. With the recent development in genetics and molecular biology techniques, large-scale marker data has become available, which requires highly efficient likelihood calculations especially for complex pedigrees. Algorithms for likelihood calculations for pedigree data are reviewed in chapter 2. Besides exact likelihood calculation methods and MCMC, a Sequential Importance Sampling (SIS) approach has been proposed to enable calculations for large pedigrees with large numbers of markers. However, when the system gets large, the variance of the importance sampling weights increases while both efficiency and accuracy of the method decrease. We propose an optimization algorithm for calculating the likelihood of general pedigrees in Chapter 3. We incorporate a resampling strategy into SIS to reduce the variance inflation problem. A successful linkage analysis may identify a linkage region of interest containing hundreds of genes at a magnitude of perhaps ten to thirty centiMorgans. A follow-up association (or so-called linkage disequilibrium) analysis can provide much finer gene-mapping but is subject to greater multiple testing problems. In Chapter 4, we present a method for determining whether an association result is responsible for a non-parametric linkage result for binary traits in general pedigrees. The correlation between family frequency of a variant of interest and family LOD score is used as a measure of whether the association between a given variant at a marker and the disease status can help to explain a significant linkage result seen in the collection of families in the region around the marker.
05 November 2007
Genetic association studies aim to detect association between one or more genetic polymorphisms and complex traits, which might be some quantitative characteristic or a qualitative attribute of disease. In Chapter 1, we introduce the development of methods for association mapping in the past decades and present the rationale behind our X-linked method development. Family-based association methods have been well developed for autosomes, but unique features of X-linked markers have received little attention. In Chapter 2, we propose a likelihood approach (X-LRT) to estimate genetic risks and test association using a case-parents design. The method uses nuclear families with a single affected proband, and allows additional siblings and missing parental genotypes. We also extend X-LRT from a single-marker test to a multiple-marker haplotype analysis. Our X-LRT offers great flexibility for testing different penetrance relationships within and between sexes. In addition, estimation of relative risks provides a measure of the magnitude of X-linked genetic effects on complex disorders. In Chapter 3 and 4, we fill the methodological gaps by developing two approaches (X-QTL and X-HQTL) to test association between X-linked marker alleles/haplotypes and quantitative traits in nuclear family design. We adopt the orthogonal decomposition which provides consistent estimates of the additive genetic values of marker alleles/haplotypes. Joint estimation of the linkage variance component in the association model reduces type I errors to nominal expectations. Dosage compensation models provide a simple relationship of X-linked additive effects between sexes. In Chapter 2, 3, and 4, our simulation results demonstrate the validity and substantially higher power of our approaches compared with other existing programs. We also apply our methods to MAOA & MAOB candidate-gene studies of family data with Parkinson disease. In Chapter 5, we discuss some issues relevant to the design and execution of our X-linked family-based association studies.
STATISTICAL METHODS FOR FAMILY-BASED ASSOCIATION STUDIES FOR COMPLEX HUMAN DISEASES: SINGLE-LOCUS AND HAPLOTYPE METHODSChung, Ren-Hua 15 December 2006 (has links)
Disease-gene fine-mapping is an important task in human genetics. Linkage and association analyses are the two main approaches for exploring disease susceptibility genes. In Chapter 1, we introduce the development of methods for disease-gene mapping in the past decades and present the rationale behind our new method development. Family-based association analyses have provided powerful tools for disease-gene mapping. The Association in the Presence of Linkage test (APL), a family-based association method, can use nuclear families with multiple affected siblings and infer missing parental genotypes properly in the linkage region. In Chapter 2, we generalized and extended APL so that it can be applied to general nuclear family structures using a bootstrap variance estimator. Unlike the original APL that can handle at most two affected siblings, the new APL can handle up to three affected siblings. We also extended APL from a single-marker test to a multiple-marker haplotype analysis. According to our simulations, the new APL has a correct type I error rate and more power than other family-based association methods such as PDT, FBAT/HBAT, and PDTPHASE in nuclear families with missing parents. The robustness of APL when there are rare alleles or haplotypes and when there is population substructure such that the allele frequencies in the population deviated from the Hardy-Weinberg Equilibrium (HWE) assumption was also examined in Chapter 2. Genes on the X chromosome play a role in many common diseases. Linkage analyses have identified regions on the X chromosome with high linkage peaks for several diseases. Currently there are few family-based association methods available for X-chromosome markers. In order to fill in this gap, we proposed a novel family-based association method, X-APL, in Chapter 3. X-APL is a modification of APL and shares some important properties with APL. X-APL can also perform haplotype analyses, which is the only family-based test of association we are aware of for testing haplotypes for the X-chromosome markers. Our simulation results showed that X-APL has a correct type I error rate and has more power than other family-based association methods for X chromosome such as XS-TDT, XPDT and XMCPDT for single-marker analysis in nuclear families. The robustness of X-APL when there are deviations of genotype frequencies from HWE was also examined in Chapter 3. Linkage and family-based association analyses are often applied simultaneously in the same data in order to maximize use of family data sets. However, it is not intuitively clear under what conditions association and linkage tests performed in the same data set may be correlated. In Chapter 4, we used computer simulations and theoretical statements to estimate the correlation between linkage statistics (affected sib pair maximum LOD scores) and family-based association statistics (PDT and APL) under various hypotheses. Different types of pedigrees were studied: nuclear families with affected sib pairs, extended pedigrees and incomplete pedigrees. Both simulation and theoretical results showed that when there is either no linkage or no association, the linkage and association statistics are not correlated. When there is linkage and association in the data, the two tests have a positive correlation.
Conners, Shannon Burns
01 December 2005
Carbohydrate utilization and production pathways identified in Thermotoga species likely contribute to their ubiquity in hydrothermal environments. Many carbohydrate-active enzymes from Thermotoga maritima have been characterized biochemically; however, sugar uptake systems and regulatory mechanisms that control them have not been well defined. Transcriptional data from cDNA microarrays were examined using mixed effects statistical models to predict candidate sugar substrates for ABC (ATP-binding cassette) transporters in T. maritima. Genes encoding proteins previously annotated as oligopeptide/dipeptide ABC transporters responded transcriptionally to various carbohydrates. This finding was consistent with protein sequence comparisons that revealed closer relationships to archaeal sugar transporters than to bacterial peptide transporters. In many cases, glycosyl hydrolases, co-localized with these transporters, also responded to the same sugars. Putative transcriptional repressors of the LacI, XylR, and DeoR families were likely involved in regulating genomic units for beta-1,4-glucan, beta-1,3-glucan, beta-1,4-mannan, ribose, and rhamnose metabolism and transport. Carbohydrate utilization pathways in T. maritima may be related to ecological interactions within cell communities. Exopolysaccharide-based biofilms composed primarily of ?Ò-linked glucose, with small amounts of mannose and ribose, formed under certain conditions in both pure T. maritima cultures and mixed cultures of T. maritima and M. jannaschii. Further examination of transcriptional differences between biofilm-bound sessile cells and planktonic cells revealed differential expression of beta-glucan-specific degradation enzymes, even though maltose, an alpha-1,4 linked glucose disaccharide, was used as a growth substrate. Higher transcripts of genes encoding iron and sulfur compound transport, iron-sulfur cluster chaperones, and iron-sulfur cluster proteins suggest altered redox environments in biofilm cells. Further direct comparisons between cellobiose and maltose-grown cells suggested that transcription of cellobiose utilization genes is highly sensitive to the presence of cellobiose, or a cellobiose-maltose mixture. Increased transcripts of genes related to polysulfide reductases in cellobiose-grown cells and biofilm cells suggested that T. maritima cells in pure culture biofilms escaped hydrogen inhibition by preferentially reducing sulfur compounds, while cells in mixed culture biofilms form close associations with hydrogen-utilizing methanogens. In addition to probing issues related to the microbial physiology and ecology of T. maritima, this work illustrates the strategic use of DNA microarray-based transcriptional analysis for functional genomics studies.
Towards a Toxico-Chemogenomic Future: The Transformation of Public Gene Expression Data and Consideration for its Use.Williams-DeVane, ClarLynda Raynell 16 December 2008 (has links)
The term âtoxico-chemogenomicsâ is used to convey extension of toxicogenomics to more broadly survey gene expression changes across chemical space. Moving towards an improved, publicly available toxico-chemogenomics capability requires not only common data standards and protocols across public resources, but also broad data coverage within the chemical, genomics and toxicological information domains, and transparent and functional linkages of Internet data resources. The first goal of this project was to assess the current extent of standardization, interoperability, and chemical indexing of public genomics resources with respect to toxico-chemogenomics utility. Focusing on the largest of these public data resources â Gene Expression Omnibus (GEO) and ArrayExpress -- the second goal was to chemically index the full experimental content of these repositories to assess the current coverage of chemical exposure-related microarray experiments in relation to chemical space and toxicology, and to make these data accessible in relation to other publicly available, chemically-indexed toxicological information. Current standards for chemical annotation within ArrayExpress and GEO are presently inadequate to this task, such that development of new methodologies to mine the author-submitted content was required. A series of automated Perl programs were utilized along with extensive manual review to transform the raw experiment/study descriptions and text files into a standardized chemically-indexed inventory of microarray experiments in both resources. These files and top-level experiment annotations allowed for identification of all current chemical-associated experimental content as well as the subset of chemical exposure-related (or âTreatmentâ) content deemed most relevant to toxicogenomics in the GEO Series and ArrayExpress Repository experiment inventories. With chemical exposure experiments suitably indexed by chemical structure, it is possible for the first time to assess the breadth of chemical study space represented in these databases, as well as the overlapping chemical content, and to begin to assess the sufficiency of data for making chemical similarity inferences. Chemical indexing of public genomics databases is also the first step towards integrating chemical, toxicological and genomics data into predictive toxicology by providing linkages across public resources. The main products of this effort include the following: (1) published, downloadable and structure-searchable DSSTox Structure-Index (Locator) files for both the GEO Series (GEOGDS) and ArrayExpress Repository (ARYEXP), containing standard chemical fields for the unique chemical âTreatmentâ subset, accompanied by URLs to AccessionID experiment pages in GEO and ArrayExpress; (2) published, downloadable DSSTox Aux data files for GEOGDS and ARYEXP providing a chemical-experiment pair index to all chemical-associated content in each resource and containing 14 standard genomics fields (e.g., Experiment_Title, Experiment_Description, Experiment_ArrayType, Species, Number_Samples, etc.) and source-specific fields extracted from each resource (e.g., MIAME_Protocol, MIAMI_Factors, etc. for ArrayExpress); and (3) incorporation of the âTreatmentâ chemical-experiment pair index with URLs linked directly to AccessionID pages for GEO and ArrayExpress into the National Center for Biotechnology Information (NCBI) PubChem resource. The secondary product of this effort is a methodology discussion about the proper use of public microarray data with a demonstrative analysis of how one might use the newly identified public microarray data.
Thesis (Ph.D.) -- University of Texas at Arlington, 2008.
Page generated in 0.0859 seconds