Large-scale genomics projects such as the Cancer Genome Atlas (TCGA), and the Encyclopedia of DNA Elements (ENCODE) involve generation of data at an unprecedented scale, requiring new computational techniques for analysis and interpretation. In the three studies I present in this thesis, I utilize these data sources to derive biological insights or created visualization tools that enable others to obtain insights more easily. First, I examine the distribution of the lengths for copy number variations (CNVs) in the cancer genome. This analysis shows that a small number of genes are altered at a greater frequency than expected from a power law distribution, suggesting that a large number of genomes must be sequenced for a given tumor type to a comprehensive discovery of somatic mutations. Second, I investigate germline CNVs in thousands of TCGA samples using single nucleotide polymorphism (SNP) array data to find variants that may confer increased susceptibility to cancer. This CNV-based genome-wide association study resulted in many germline CNVs that potentially increase risk in brain, breast, colorectal, renal, or ovarian cancers. Finally, I apply several visualization techniques to create tools for the TCGA and ENCODE projects in order to help investigators better process and synthesize meaning from large volume of data. Seqeyes combines linear and circular genomic views to explore predicted structural variations to help guide experimental validation. The modEncode browser visualizes chromatin organization by integrating data from a multitude of histone marks and chromosomal proteins. These results present visualization as a useful strategy for rapid identification of salient genomic features from large, heterogeneous genomic datasets.
Identifer | oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/15066 |
Date | 22 January 2016 |
Creators | Park, Richard Won |
Source Sets | Boston University |
Language | en_US |
Detected Language | English |
Type | Thesis/Dissertation |
Page generated in 0.0022 seconds