Computational genomics is at the intersection of computational applied physics, math, statistics, computer science and biology. With the advances in sequencing technology, large amounts of comprehensive genomic data are generated every year. However, the nature of genomic data is messy, complex and unstructured; it becomes extremely challenging to explore, analyze and understand the data based on traditional methods. The needs to develop new quantitative methods to analyze large-scale genomics datasets are urgent. By collecting, processing and organizing clean genomics datasets and using these datasets to extract insights and relevant information, we are able to develop novel methods and strategies to address specific genetics questions using the tools of applied mathematics, statistics, and human genetics.
This thesis describes genetic and bioinformatics studies focused on utilizing and developing state-of-the-art computational methods and strategies in order to identify and interpret de novo mutations that are likely causing developmental disorders. We performed whole exome sequencing as well as whole genome sequencing on congenital diaphragmatic hernia parents-child trios and identified a new candidate risk gene MYRF. Additionally, we found male and female patients carry a different burden of likely-gene- disrupting mutations, and isolated and complex patients carry different gene expression levels in early development of diaphragm tissues for likely-gene-disrupting mutations.
To increase the power to detect risk genes and risk variants, we developed a deep neural network classifier called MVP to accurately predict the pathogenicity of missense variants. MVP implemented an advanced structure of ResNet model and based on two independent data sets, MVP achieved clearly better results in prioritizing pathogenic variants than other methods. Additionally, we studied the genetic connection between developmental disorders and cancer. We found that in developmental disorder patients predicted deleterious de novo mutations are more enriched in cancer driver genes than non cancer driver genes. A Hidden Markov Model was implemented to discover cancer somatic missense mutation hotspots and we demonstrated many cancer driver genes shared a similar mode of action in developmental disorders and caner. By improving ability to interpret missense mutations and leveraging cancer genomics data, we can improve risk gene inference in developmental disorders.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/D8N02QDR |
Date | January 2018 |
Creators | Qi, Hongjian |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.002 seconds