361 |
Application of Graph Theoretic Clustering on Some Biomedical Data SetsAhlert, Darla 11 June 2015 (has links)
<p> Clustering algorithms have become a popular way to analyze biomedical data sets and in particular, gene expression data. Since these data sets are often large, it is difficult to gather useful information from them as a whole. Clustering is a proven method to extract knowledge about the data that can eventually lead to many discoveries in the biological world. Hierarchical clustering is used frequently to interpret gene expression data, but recently, graph-theoretic clustering algorithms have started to gain some attraction for analysis of this type of data. We consider five graph-theoretic clustering algorithms run over a post-mortem gene expression dataset, as well as a few different biomedical data sets, in which the ground truth, or class label, is known for each data point. We then externally evaluate the algorithms based on the accuracy of the resulting clusters against the ground truth clusters. Comparing the results of each of the algorithms run over all of the datasets, we found that our algorithms are efficient on the real biomedical datasets but find gene expression data especially difficult to handle.</p>
|
362 |
Automated Prediction of Human Disease GenesBlom, Martin 21 February 2013 (has links)
The completion of the human genome project has led to a flood of new genetic data, that has proved surprisingly hard to interpret. Network "guilt by association" (GBA) is a proven approach for identifying novel disease genes based on the observation that similar mutational phenotypes arise from functionally related genes.
However, GBA has been shown to work poorly in genome-wide association studies (GWAS), where many genes are somewhat implicated, but few are known with very high certainty. In the first part of this work, I resolve this by explicitly modeling the uncertainty of the associations and incorporating the uncertainty for the seed set into the GBA framework. I demonstrate a significant boost in the power to detect validated candidate genes for Crohn’s disease and type 2 diabetes by comparing the predictions from my method to results from follow-up meta-analyses, with incorporation of the network serving to highlight the JAK--STAT pathway and associated adaptors GRB2/SHC1 in Crohn’s disease and BACH2 in type 2 diabetes. Consideration of the network during GWAS thus conveys some of the benefits of enrolling more participants in the GWAS study. More generally, we demonstrate that a functional network of human genes provides a valuable statistical framework for prioritizing candidate disease genes in GWAS-based studies.
Furthermore, functional gene networks are not the only kind of information that can be used to predict gene--phenotype associations. In the second part of this thesis, I show that gene-phenotype associations in model species from species as distantly related to humans as E. coli is another valuable source of information, that can be mined using methods similar to those used in recommender systems.
Finally, in the last part of this thesis, I present a machine learning formalism that combines the functional gene network and model species phenotype information. I show that this approach outperforms the state of the art methods for gene-phenotype association prediction using cross-validation. / text
|
363 |
Aligning multiple sequences adaptivelyYe, Yongtao, 叶永滔 January 2014 (has links)
With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment.
First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging.
Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related.
With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). / published_or_final_version / Computer Science / Master / Master of Philosophy
|
364 |
The role of read depth in the design and analysis of sequencing experimentsRobinson, David Garrett 04 September 2015 (has links)
<p> The development of quantitative sequencing technologies, such as RNA-Seq, Bar-Seq, ChIP-Seq, and metagenomics, has offered great insight into molecular biology. Proper design and analysis of these experiments require statistical models and techniques that consider the specific nature of sequencing data, which typically consists of a matrix of read counts per feature. An issue of particular importance to the development of these methods is the role of read depth in statistical accuracy and power. The depth of an experiment affects the power to make biological conclusions, meaning an experiment design must consider the tradeoff between cost, power, and the number of samples that are examined. Similarly, per-gene read depth affects each gene's power and accuracy, and must be taken into account in any downstream analysis. </p><p> Here I explore many facets of the role of read depth in the design and analysis of sequencing experiments, and offer computational and statistical methods for addressing them. To assist in the design of sequencing experiments, I present subSeq, which examines the effect of depth in an experiment by subsampling reads to simulate lower depths. I use this method to examine the extent of read saturation across a variety of RNA-Seq experiments, and demonstrate a statistical model for predicting the effect of increasing depth in any experiment. I consider intensity-dependence in a technology comparison between microarrays and RNA-Seq, and show that the variance added by RNA-Seq depends more on depth than the variance in microarray depends on fluorescence intensity. I demonstrate that Bar-Seq data shares these depth-dependent properties with RNA-Seq and can be analyzed by the same tools, and further provide suggestions on the appropriate depth for Bar-Seq experiments. Finally, I show that per-gene read depth can be taken into account in multiple hypothesis testing to improve power, and introduce the method of functional false discovery rate (fFDR) control.</p>
|
365 |
The identification and study of known and novel variants in spinocerebellar ataxiaReed, Patrick Jennings 16 September 2015 (has links)
<p> The study of how genotypes encode phenotypes is germane if not central to every area of genetics research. This thesis focuses on the application of targeted and Whole Exome Sequencing (WES) in the discovery, identification and understanding of Mendelian disorders primarily with Spinocerebellar Ataxia(s) (SCA) phenotypes. Analogous to Mendelian disorders as a whole, SCA are a complex group of disorders with both shared and unique clinical symptoms. Currently, 28 genetically unique forms of autosomal dominant SCA have been identified. This thesis begins by exploring the potential role of targeted Next Generation Sequencing (NGS) as a clinical and diagnostic tool. The benefits of using targeted sequencing in a clinical setting are two-fold. First, it presents the opportunity to rapidly screen symptomatic individuals for all known genetic variants associated with ataxia phenotypes, thereby greatly increasing the likelihood and accuracy of a diagnosis. Second, symptomatic patients who test negative for all known variants may harbor a novel genetic variant. The identification of novel forms of SCA is of great importance both clinically and in basic research. As a case in point, this thesis uses WES to identify the genetic basis of a rare autosomal dominant SCA affecting a multi-generational kindred of Canadian European descent. Four affected and two unaffected individuals spanning three generations have been sequenced to identify the causal variant. The ability of WES to identify the pathogenic variant of a Mendelian disorder from only four affected individuals is a significant benchmark in genomics research. It is both feasible and probable that as reference databases become more extensive, the genetic diagnosis of human disease from a single affected individual will be common practice. This thesis concludes by examining Mendelian disease variants from a broader perspective. Large exome variant cohorts of asymptomatic individuals are examined for the presence of known pathogenic Mendelian variants. The presence of such variants in a reference database is empirical evidence that pathogenic variants, while necessary, are not sufficient to cause many Mendelian disorders. Specifically, we demonstrate that variable penetrance and expressivity are pervasive factors in Mendelian genetics that have yet to be fully appreciated.</p>
|
366 |
A Motif Discovery and Analysis Pipeline for Heterogeneous Next-Generation Sequencing DataRamsay, Trevor 10 October 2015 (has links)
<p> Bioinformatics has made great strides in understanding the regulation of gene expression, but many of the tools developed for this purpose depend on data from a limited number of species. Despite their unique genetic attributes, there remains a dearth of research into undomesticated trees. The poplar tree, <i> Populus trichocarpa</i>, has undergone multiple rounds of genome duplication during its evolution. In addition its life cycle varies from other annual crop and model plants previously studied, leading to significant technical challenges to understand the unique biology of these trees. For example, the process of secondary growth occurs as the tree stems thicken, and creates secondary xylem (wood) and phloem (inner bark) for water and products of photosynthesis transport, respectively. Because of this, the research group I work with studies the secondary growth of <i>P. trichocarpa</i> (Spicer, 2010) (Groover, et al., 2010) (Groover, et al., 2006) (Groover, 2005).</p><p> The genomic tools to investigate gene regulation in <i>P. trichocarpa </i> are readily available. Next-generation sequencing technologies such as RNA-Seq and ChIP-Seq can be used to understand gene expression and binding of transcription factors to specific locations in the genome. Similarly, a variety of specialized bioinformatic tools such as EdgeR, Cufflinks, and MACS can be used to analyze gene binding and expression from sequencing data provided by ChIP-seq and RNA-seq (Blahnik, et al., 2010) (Mortazavi, et al., 2008) (Robinson, 2010) (Robinson, 2007) (Robinson, et al., 2008) (McCarthy, 2012) (Trapnell, 2013) (Zhang, 2008). The binding and expression data these tools provide form a foundation for analyzing the gene expression regulation in <i> P. trichocarpa.</i></p><p> The goal of my project is to provide a motif discovery and analysis pipeline for analyses of <i>Populus</i> species. The motif discovery and analysis pipeline utilizes heterogeneous data collected from poplar and aspen mutants to elucidate the gene regulatory mechanisms involved in secondary growth. The experiments target transcription factors related to secondary growth, and through analysis of the variety of transcription factor binding experiments, I have identified the motifs involved in gene regulation of secondary growth within <i>P. trichocarpa.</i> (Filkov, et al., 2008).</p>
|
367 |
Exploring the Use of Electronic Health Record-Linked Biorepositories for Pharmacogenomic Application and DiscoveryGonzaludo, Nina 14 October 2015 (has links)
<p> Drug response is well documented to vary considerably among patient groups and populations, as well as within individual patients. Since drug prescribing is often based on population averages of drug response, many patients will not respond, and up to one-third may experience harmful toxicity. Genetics plays a large role in explaining the variability observed in response to different drugs and is an important factor driving precision medicine initiatives. Pharmacogenetic information can be useful in optimizing patient therapy, potentially reducing the cost of hospitalizations and treatment of adverse drug events. </p><p> As part of the Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH), we analyzed 102,979 members of the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort with genetic information available, along with almost two decades of electronic health record (EHR) data, prescription records, and lifestyle survey results. In one of the largest, most ethnically diverse pharmacogene characterization studies to date, we assessed cohort metabolizer status phenotypes for 7 drug-gene interactions (DGIs) for which there is moderate to strong evidence suggesting the use of pharmacogenetic information to guide therapy. 89% of the cohort had at least one actionable allele for the 7 DGIs in this study, and we observed large variations among ethnicities. Additionally, 17,747 individuals had been prescribed a drug for which they had an actionable or high-risk metabolizer status phenotype. For these individuals, the availability of pharmacogenetic information at point-of-care may have potentially led to a more personalized drug or dosing regimen. </p><p> Following this study, we assessed the utility of this resource for deriving two drug response phenotypes: weight gain induced by atypical antipsychotic use and major adverse cardiovascular events in clopiodgrel non-responders. Despite challenges in deriving phenotypes from the EHR, we were able to extract phenotypes that reflected observed estimates from previously published studies. Using these phenotypes, we performed candidate gene and genome-wide association studies to identify genetic variants associated with response. Altogether, this dissertation demonstrates the potential utility and clinical impact of integrating genetic data with EHRs for pharmacogenetic application and discovery, and provides the foundation for future studies in precision medicine.</p>
|
368 |
Identification of Dermacentor andersoni saliva proteins that modulate mammalian phagocyte functionMudenda, Lwiindi 13 August 2015 (has links)
<p> Ticks are obligate blood sucking parasites which transmit a wide range of pathogens worldwide including protozoa, bacteria and viruses. Additionally, tick feeding alone may result in anemia, dermatosis and toxin-induced paralysis. <i> Dermacentor andersoni</i> is a species of tick found in the western United States that transmits pathogens of public health importance including <i> Rickettsia rickettsii, Francisella tularensis,</i> and Colorado Tick Fever Virus, as well as <i>Anaplasma marginale</i>, a rickettsial pathogen that causes economic losses in both the dairy and beef industries worldwide. <i>D. andersoni</i> ticks are obligate blood sucking parasites that require a blood meal through all stages of their lifecycle. During feeding, ticks secrete factors that modulate both innate and acquired immune responses in the host which enables them to feed for several days without detection. The pathogens transmitted by ticks exploit these immunomodulatory properties to facilitate invasion of and replication in the host. Molecular characterization of these immunomodulatory proteins secreted in tick saliva offers an opportunity to develop novel anti-tick vaccines as well as anti-inflammatory drug targets. To this end we performed deep sequence analysis on unfed ticks and ticks fed for 2 or 5 days. The pooled data generated a database of 21,797 consensus sequences. Salivary gland gene expression levels of unfed ticks were compared to 2- and 5-day fed ticks to identify genes upregulated early during tick feeding. Next we performed mass spectrometry on saliva from 2- and 5-day fed ticks and used the database to identify 677 proteins. We cross referenced the protein data with the transcriptome data to identify 157 proteins of interest for immunomodulation and blood feeding. Both proteins of unknown function and known immunomodulators were identified. We expressed four of these proteins and tested them for inhibition of macrophage activation and/or cytokine expression in vitro. The results showed diverse effects of the various test proteins on the inflammatory response of mouse macrophage cell lines. The proteins upregulated some cytokines while downregulating others. However, all the proteins upregulated the regulatory cytokine IL-10.</p>
|
369 |
The Roles of RNA-binding Proteins in the Developing Nervous SystemQuan, Jie January 2013 (has links)
RNA-binding proteins are key players in post-transcriptional regulation of gene expression by orchestrating RNA fate from synthesis to decay. Hundreds of proteins with RNA-binding capacity have been identified so far, yet only a small fraction has been functionally characterized and presumably many more RNA-binding proteins await discovery. The roles of RNA-binding proteins in the nervous system are of particular interest because accumulative evidence has linked RNA-based mechanisms to neural development, maintenance and repair. Here, the three RNA-binding proteins under study are IGF-II mRNA binding proteins IMP-1 and IMP-2, known to be involved in mRNA localization, translational control and stability, and adenomatous polyposis coli (APC), identified as a novel RNA-binding protein. To systematically identify their RNA binding profiles, a high-throughput approach combining protein-RNA crosslinking and immunoprecipitation with next-generation sequencing (HITS-CLIP) was applied in embryonic mouse brain. A nonparametric method was developed to computationally analyze the CLIP sequencing data, mapping transcriptome-wide protein-RNA interactions. The identified target mRNAs of IMP-1 and IMP-2 were highly enriched for functions related to neural development, especially neuron projection morphogenesis and axon guidance signaling. Moreover, these target mRNAs were associated with a variety of neurological diseases, including neurodevelopmental and neurodegenerative disorders. Supporting roles in axon development, knockdown of IMP-1 or IMP-2 caused aberrant trajectories of commissural axons in chicken spinal cord. APC mRNA targets were highly enriched for APC-related functions, including microtubule organization, cell and axon motility, Wnt signaling, cancer and neurological disease. Among the APC targets was Tubulin β-2B (Tubb2b), previously known to be required for neuronal migration. It was found that Tubb2b was synthesized in axons, and localized preferentially to dynamic microtubules in the peripheral domain of the growth cone. Blocking the APC binding site in the Tubb2b mRNA 3'UTR caused reduction in its expression in axons and loss of the growth cone peripheral area, and impaired cortical neuron migration in vivo. These findings offer an informative snapshot of the protein-RNA interactome, which can provide a basis to better understand the roles of RNA-binding proteins in the nervous system.
|
370 |
The Evolutionary Feedback between Genetic Conflict and Genome ArchitectureYoung, Adrian 07 July 2015 (has links)
The advent of separate sexes set the stage for dramatic evolutionary innovation across a wide range of taxa. Much of this innovation is attributable to divergent evolutionary interests between now distinct sub-populations of males and females. Trade-offs inherent to these divergent life histories, coupled with a common genome, conspire to limit natural selection's ability to simultaneously maximize the fitness of both sexes. Such conflict between the sexes has therefore largely shaped the history of the genomes of sexual taxa. However, various aspects of the genomic environment—including genes' spatial distributions, abilities to regulate their expression, and rates of recombination—also feed back to influence future sex-specific evolutionary trajectories. Using various genomic resources and transcriptome sequences for the lab mouse, I test several theoretical predictions regarding this feedback between genetic conflict and features of genomic organization.
|
Page generated in 0.1074 seconds