二代cDNA测序技术,又名“RNA-Seq“,为转录组(transcriptome)的研究提供了新的手段。作为革命性的技术方法,RNA-Seq 不仅可以帮助准确测量转录体(transcript)的表达水平,更可以发现新的转录体和揭示转录调控的机理。同时,整合多个不同水平的测序数据,例如基因组(genome)测序,甲基化组(methylome)测序等,可以为深入挖掘生物学意义提供一个强有力的的工具。 / 我的博士研究主要集中在二代测序(next-generation sequencing,NGS),特别是RNA-Seq数据的分析。它主要包含三部分:分析工具开发,数据分析和机理研究。 / 大量测序数据的分析对于二代测序技术来说是一个重大的挑战。目前,相对于剪接比对工具(splice-aware aligner),普通比对工具可以极速(ultrafast)的将数以千万记的短序列(Reads)比对到基因组,但是他们很难处理那些跨过剪接位点(splice junction)的短序列(spliced reads)或者匹配多个基因组位置的短序列(multireads)。我们开发了一个利用two-seed策略的全新的序列比对工具-ABMapper。基准测试(Benchmark test) 结果显示ABMapper比其他的同类工具:TopHat和SpliceMap有更高的accuracy和recall。另一方面,spliced reads和multireads在基因组上会有多个匹配的位置,选择最可能的位置也成为一个大问题。在计算基因表达值时,multireads和spliced reads常会被随机的选定其中之一,或者直接被排除。这种处理方式会引入偏差而直接影响下游(downstream)分析的准确性。为了解决multireads和spliced reads位置选择问题,我们提出了一个利用内含子(intron)长度的Geometric-tail (GT) 经验分布的最大似然估计 (maximum likelihood estimation) 的方法。这个概率模型可以适用于剪接位点位于短序列上或者位于成对短序列(Pair-ended, PE) 之间的情况。基于这个模型,我们可以更好的确定那些在基因组上存在多个匹配的成对短序列(pair-ended, PE reads)的最可能位置。 / 测序数据的积累为深入研究生物学意义提供了丰富的资源。利用RNA-Seq数据和甲基化测序数据,我们建立了一个基于DNA甲基化模式 (pattern) 的基因表达水平的预测模型。根据这个模型,我们发现DNA甲基化可以相当准确的预测基因表达水平,准确率达到78%。我们还发现基因主体上的DNA甲基化比启动子 (promoter) 附近的更重要。最后我们还从整合所有甲基化模式和CpG模式的组合数据集中,利用特征筛选(feature selection)选择了一个最优化子集。我们基于最优子集建立了特征重叠作用网络,进一步揭示了DNA甲基化模式对于基因表达的协作调控机理。 / 除了开发RNA-Seq数据分析的工具和数据挖掘,我们还分析斑马鱼(zebrafish)的转录组(transcriptome)。RNA-Seq数据分析结合荧光成像,定量PCR等生物学实验,揭示了Calycosin处理之后的相关作用通路(pathway)和差异表达基因,分析结果还证明了Calycosin在体内的血管生成活性。 / 综上所述,本论文将会详细阐述我在二代测序数据分析,基于数据挖掘的生物学意义的发现和转录组分析方面的工作。 / The recent development of next generation RNA-sequencing, termed ‘RNA-Seq’, has offered an opportunity to explore the RNA transcripts from the whole transcriptome. As a revolutionary method, RNA-Seq not only could precisely measure the abundances of transcripts, but discover the novel transcribed contents and uncover the unknown regulatory mechanisms. Meanwhile, the combination of different levels of next-generation sequencing, such as genome sequencing and methylome sequencing has provided a powerful tool for novel discovery in the biological context. / My PhD study focuses on the analysis of next-generation sequencing data, especially on RNA-Seq data. It mainly includes three parts: pipeline development analysis, data analysis and mechanistic study. / As the next-generation sequencing (NGS) technology, the analysis of massive NGS data is a great challenge. Many existing general aligners (as contrast to splicing-aware alignment tools) are capable of mapping millions of sequencing reads onto a reference genome. However, they are neither designed for reads that span across splice junctions (spliced reads) nor for reads that could match multiple locations along the reference genome (multireads). Hence, we have developed an ab initio mapping method - ABMapper, using two-seed strategy. The benchmark results show that ABMapper can get higher accuracy and recall compared with the same kind of tools: TopHat and SpliceMap. On the other hand, the selection of the most probable location for spliced reads and multireads becomes a big problem. These reads are randomly assigned to one of the possible locations or discarded completely when calculating the expression level, which would bias the downstream analysis, such as the differentiated expression analysis and alternative splicing analysis. To rationally determine the location of spliced reads and multireads, we have proposed a maximum likelihood estimation method based on a geometric-tail (GT) distribution of intron length. This probabilistic model deals with splice junctions between reads, or those encompassed in one or both of a pair-ended (PE) reads. Based on this model, multiple alignments of reads within a PE pair can be properly resolved. / The accumulation of NGS data has provided rich resources for deep discovery of biological significance. We have integrated RNA-Seq data and methylation sequencing data to build a predictive model for the regulation of gene expression based on DNA methylation patterns. We found that DNA methylation could predict gene expression fairly accurately and the accuracy can reach up to 78%. We have also found DNA methylation at gene body is the most important region in these models, even more useful than promoter. Finally, feature overlap network based on an optimum subset of combination of all methylation patterns and CpG patterns has indicated the collaborative regulation of gene expression by DNA methylation patterns. / Not only new algorithms were developed to facilitate the RNA-Seq data analysis, but the transcriptome analysis was performed on zebrafish. The analysis of differentially-expressed genes and pathways involved after calycosin treatment, combined with other experimental evidence such as fluorescence microscopy and quantitative real-time polymerase chain reaction (qPCR), has well demonstrated the proangiogenic effects of calycosin in vivo. / In summary, this thesis detailed my work on NGS data analysis, discovery of biological significance using data-mining algorithms and transcriptome analysis. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Lou, Shaoke. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 135-146). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / 摘要 --- p.iii / Acknowledgement --- p.v / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Bioinformatics --- p.1 / Chapter 1.2 --- Bioinformatics application --- p.1 / Chapter 1.3 --- Motivation --- p.2 / Chapter 1.4 --- Objectives --- p.3 / Chapter 1.5 --- Thesis outline --- p.3 / Chapter Chapter 2 --- Background --- p.4 / Chapter 2.1 --- Biological and biotechnology background --- p.4 / Chapter 2.1.1 --- Central dogma and biology ABC --- p.4 / Chapter 2.1.2 --- Transcription --- p.5 / Chapter 2.1.3 --- Splicing and Alternative Splicing --- p.6 / Chapter 2.1.4 --- Next-generation Sequencing --- p.10 / Chapter 2.1.5 --- RNA-Seq --- p.18 / Chapter 2.2 --- Computational background --- p.20 / Chapter 2.2.1 --- Approximate string matching and read mapping --- p.21 / Chapter 2.2.2 --- Read mapping algorithms and tools --- p.22 / Chapter 2.2.3 --- Spliced alignment tools --- p.27 / Chapter Chapter 3 --- ABMapper: a two-seed based spliced alignment tool --- p.29 / Chapter 3.1 --- Introduction --- p.29 / Chapter 3.2 --- State-of-the-art --- p.30 / Chapter 3.3 --- Problem formulation --- p.31 / Chapter 3.4 --- Methods --- p.33 / Chapter 3.5 --- Results --- p.35 / Chapter 3.5.1 --- Benchmark test --- p.35 / Chapter 3.5.2 --- Complexity analysis --- p.39 / Chapter 3.5.3 --- Comparison with other tools --- p.39 / Chapter 3.6 --- Discussion and conclusion --- p.41 / Chapter Chapter 4 --- Geometric-tail (GT) model for rational selection of RNA-Seq read location --- p.42 / Chapter 4.1 --- Introduction --- p.42 / Chapter 4.2 --- State-of-the-art --- p.44 / Chapter 4.3 --- Problem formulation --- p.44 / Chapter 4.4 --- Algorithms --- p.45 / Chapter 4.5 --- Results --- p.49 / Chapter 4.5.1 --- Workflow of GT MLE method --- p.49 / Chapter 4.5.2 --- GT distribution and insert-size distribution --- p.50 / Chapter 4.5.3 --- Multiread analysis --- p.51 / Chapter 4.5.4 --- Splice-site comparison --- p.52 / Chapter 4.6 --- Discussion and conclusion --- p.55 / Chapter Chapter 5 --- Explore relationship between methylation patterns and gene expression --- p.56 / Chapter 5.1 --- Introduction --- p.56 / Chapter 5.2 --- State-of-the-art --- p.58 / Chapter 5.3 --- Problem formulation --- p.62 / Chapter 5.4 --- Methods --- p.62 / Chapter 5.4.1 --- NGS sequencing and analysis --- p.62 / Chapter 5.4.2 --- Data preparation and transformation --- p.64 / Chapter 5.4.3 --- Random forest (RF) classification and regression --- p.65 / Chapter 5.5 --- Results --- p.68 / Chapter 5.5.1 --- Genome wide profiling of methylation --- p.68 / Chapter 5.5.2. --- Aggregation plot of methylation levels at different regions --- p.72 / Chapter 5.5.3. --- Scatterplot between methylation and gene expression --- p.75 / Chapter 5.5.4 --- Predictive model of gene expression using DNA methylation features --- p.76 / Chapter 5.5.5 --- Comb-model based on the full dataset --- p.87 / Chapter 5.6 --- Discussion and conclusion --- p.98 / Chapter Chapter 6 --- RNA-Seq data analysis and applications --- p.99 / Chapter 6.1 --- Transcriptional Profiling of Angiogenesis Activities of Calycosin in Zebrafish --- p.99 / Chapter 6.1.1 --- Introduction --- p.99 / Chapter 6.1.2 --- Background --- p.100 / Chapter 6.1.3 --- Materials and methods and ethics statement --- p.101 / Chapter 6.1.4 --- Results --- p.104 / Chapter 6.1.5 --- Conclusion --- p.108 / Chapter 6.2 --- An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database). --- p.110 / Chapter 6.2.1 --- Introduction --- p.110 / Chapter 6.2.2 --- Background --- p.110 / Chapter 6.2.3 --- Construction and content --- p.113 / Chapter 6.2.4 --- Utility and discussion --- p.116 / Chapter 6.2.5 --- Conclusion and future development --- p.119 / Chapter Chapter 7 --- Conclusion --- p.121 / Chapter 7.1 --- Conclusion --- p.121 / Chapter 7.2 --- Future work --- p.123 / Appendix --- p.124 / Chapter A1. --- Descriptive analysis of trio data --- p.124 / Chapter A2. --- Whole genome methylation level profiling --- p.125 / Chapter A3. --- Global sliding window correlation between individuals --- p.128 / Chapter A4. --- Features selected after second-run filtering --- p.133 / Bibliography --- p.135 / Chapter A. --- Publications --- p.135 / Reference --- p.135
Identifer | oai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_328197 |
Date | January 2012 |
Contributors | Lou, Shaoke., Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering. |
Source Sets | The Chinese University of Hong Kong |
Language | English, Chinese |
Detected Language | English |
Type | Text, bibliography |
Format | electronic resource, electronic resource, remote, 1 online resource (ix, 146 leaves) : ill. (chiefly col.) |
Rights | Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) |
Page generated in 0.0034 seconds