Global ETD Search

1	A novel framework for expression quantitative trait loci mapping Ai, Ni., 艾妮. January 2011 (has links) published_or_final_version / Electrical and Electronic Engineering / Master / Master of Philosophy Quantitative genetics. Gene expression - Statistical methods.
2	Transcriptome analysis and applications based on next-generation RNA sequencing data. / CUHK electronic theses & dissertations collection January 2012 (has links) 二代cDNA测序技术，又名“RNA-Seq“，为转录组(transcriptome)的研究提供了新的手段。作为革命性的技术方法，RNA-Seq 不仅可以帮助准确测量转录体(transcript)的表达水平，更可以发现新的转录体和揭示转录调控的机理。同时，整合多个不同水平的测序数据，例如基因组(genome)测序，甲基化组(methylome)测序等，可以为深入挖掘生物学意义提供一个强有力的的工具。 / 我的博士研究主要集中在二代测序(next-generation sequencing，NGS)，特别是RNA-Seq数据的分析。它主要包含三部分：分析工具开发，数据分析和机理研究。 / 大量测序数据的分析对于二代测序技术来说是一个重大的挑战。目前，相对于剪接比对工具(splice-aware aligner)，普通比对工具可以极速(ultrafast)的将数以千万记的短序列(Reads)比对到基因组，但是他们很难处理那些跨过剪接位点(splice junction)的短序列(spliced reads)或者匹配多个基因组位置的短序列(multireads)。我们开发了一个利用two-seed策略的全新的序列比对工具-ABMapper。基准测试(Benchmark test) 结果显示ABMapper比其他的同类工具：TopHat和SpliceMap有更高的accuracy和recall。另一方面，spliced reads和multireads在基因组上会有多个匹配的位置，选择最可能的位置也成为一个大问题。在计算基因表达值时，multireads和spliced reads常会被随机的选定其中之一，或者直接被排除。这种处理方式会引入偏差而直接影响下游(downstream)分析的准确性。为了解决multireads和spliced reads位置选择问题，我们提出了一个利用内含子(intron)长度的Geometric-tail (GT) 经验分布的最大似然估计 (maximum likelihood estimation) 的方法。这个概率模型可以适用于剪接位点位于短序列上或者位于成对短序列(Pair-ended, PE) 之间的情况。基于这个模型，我们可以更好的确定那些在基因组上存在多个匹配的成对短序列（pair-ended, PE reads）的最可能位置。 / 测序数据的积累为深入研究生物学意义提供了丰富的资源。利用RNA-Seq数据和甲基化测序数据，我们建立了一个基于DNA甲基化模式 (pattern) 的基因表达水平的预测模型。根据这个模型，我们发现DNA甲基化可以相当准确的预测基因表达水平，准确率达到78%。我们还发现基因主体上的DNA甲基化比启动子 (promoter) 附近的更重要。最后我们还从整合所有甲基化模式和CpG模式的组合数据集中，利用特征筛选(feature selection)选择了一个最优化子集。我们基于最优子集建立了特征重叠作用网络，进一步揭示了DNA甲基化模式对于基因表达的协作调控机理。 / 除了开发RNA-Seq数据分析的工具和数据挖掘，我们还分析斑马鱼(zebrafish)的转录组(transcriptome)。RNA-Seq数据分析结合荧光成像，定量PCR等生物学实验，揭示了Calycosin处理之后的相关作用通路(pathway)和差异表达基因，分析结果还证明了Calycosin在体内的血管生成活性。 / 综上所述，本论文将会详细阐述我在二代测序数据分析，基于数据挖掘的生物学意义的发现和转录组分析方面的工作。 / The recent development of next generation RNA-sequencing, termed ‘RNA-Seq’, has offered an opportunity to explore the RNA transcripts from the whole transcriptome. As a revolutionary method, RNA-Seq not only could precisely measure the abundances of transcripts, but discover the novel transcribed contents and uncover the unknown regulatory mechanisms. Meanwhile, the combination of different levels of next-generation sequencing, such as genome sequencing and methylome sequencing has provided a powerful tool for novel discovery in the biological context. / My PhD study focuses on the analysis of next-generation sequencing data, especially on RNA-Seq data. It mainly includes three parts: pipeline development analysis, data analysis and mechanistic study. / As the next-generation sequencing (NGS) technology, the analysis of massive NGS data is a great challenge. Many existing general aligners (as contrast to splicing-aware alignment tools) are capable of mapping millions of sequencing reads onto a reference genome. However, they are neither designed for reads that span across splice junctions (spliced reads) nor for reads that could match multiple locations along the reference genome (multireads). Hence, we have developed an ab initio mapping method - ABMapper, using two-seed strategy. The benchmark results show that ABMapper can get higher accuracy and recall compared with the same kind of tools: TopHat and SpliceMap. On the other hand, the selection of the most probable location for spliced reads and multireads becomes a big problem. These reads are randomly assigned to one of the possible locations or discarded completely when calculating the expression level, which would bias the downstream analysis, such as the differentiated expression analysis and alternative splicing analysis. To rationally determine the location of spliced reads and multireads, we have proposed a maximum likelihood estimation method based on a geometric-tail (GT) distribution of intron length. This probabilistic model deals with splice junctions between reads, or those encompassed in one or both of a pair-ended (PE) reads. Based on this model, multiple alignments of reads within a PE pair can be properly resolved. / The accumulation of NGS data has provided rich resources for deep discovery of biological significance. We have integrated RNA-Seq data and methylation sequencing data to build a predictive model for the regulation of gene expression based on DNA methylation patterns. We found that DNA methylation could predict gene expression fairly accurately and the accuracy can reach up to 78%. We have also found DNA methylation at gene body is the most important region in these models, even more useful than promoter. Finally, feature overlap network based on an optimum subset of combination of all methylation patterns and CpG patterns has indicated the collaborative regulation of gene expression by DNA methylation patterns. / Not only new algorithms were developed to facilitate the RNA-Seq data analysis, but the transcriptome analysis was performed on zebrafish. The analysis of differentially-expressed genes and pathways involved after calycosin treatment, combined with other experimental evidence such as fluorescence microscopy and quantitative real-time polymerase chain reaction (qPCR), has well demonstrated the proangiogenic effects of calycosin in vivo. / In summary, this thesis detailed my work on NGS data analysis, discovery of biological significance using data-mining algorithms and transcriptome analysis. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Lou, Shaoke. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 135-146). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / 摘要 --- p.iii / Acknowledgement --- p.v / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Bioinformatics --- p.1 / Chapter 1.2 --- Bioinformatics application --- p.1 / Chapter 1.3 --- Motivation --- p.2 / Chapter 1.4 --- Objectives --- p.3 / Chapter 1.5 --- Thesis outline --- p.3 / Chapter Chapter 2 --- Background --- p.4 / Chapter 2.1 --- Biological and biotechnology background --- p.4 / Chapter 2.1.1 --- Central dogma and biology ABC --- p.4 / Chapter 2.1.2 --- Transcription --- p.5 / Chapter 2.1.3 --- Splicing and Alternative Splicing --- p.6 / Chapter 2.1.4 --- Next-generation Sequencing --- p.10 / Chapter 2.1.5 --- RNA-Seq --- p.18 / Chapter 2.2 --- Computational background --- p.20 / Chapter 2.2.1 --- Approximate string matching and read mapping --- p.21 / Chapter 2.2.2 --- Read mapping algorithms and tools --- p.22 / Chapter 2.2.3 --- Spliced alignment tools --- p.27 / Chapter Chapter 3 --- ABMapper: a two-seed based spliced alignment tool --- p.29 / Chapter 3.1 --- Introduction --- p.29 / Chapter 3.2 --- State-of-the-art --- p.30 / Chapter 3.3 --- Problem formulation --- p.31 / Chapter 3.4 --- Methods --- p.33 / Chapter 3.5 --- Results --- p.35 / Chapter 3.5.1 --- Benchmark test --- p.35 / Chapter 3.5.2 --- Complexity analysis --- p.39 / Chapter 3.5.3 --- Comparison with other tools --- p.39 / Chapter 3.6 --- Discussion and conclusion --- p.41 / Chapter Chapter 4 --- Geometric-tail (GT) model for rational selection of RNA-Seq read location --- p.42 / Chapter 4.1 --- Introduction --- p.42 / Chapter 4.2 --- State-of-the-art --- p.44 / Chapter 4.3 --- Problem formulation --- p.44 / Chapter 4.4 --- Algorithms --- p.45 / Chapter 4.5 --- Results --- p.49 / Chapter 4.5.1 --- Workflow of GT MLE method --- p.49 / Chapter 4.5.2 --- GT distribution and insert-size distribution --- p.50 / Chapter 4.5.3 --- Multiread analysis --- p.51 / Chapter 4.5.4 --- Splice-site comparison --- p.52 / Chapter 4.6 --- Discussion and conclusion --- p.55 / Chapter Chapter 5 --- Explore relationship between methylation patterns and gene expression --- p.56 / Chapter 5.1 --- Introduction --- p.56 / Chapter 5.2 --- State-of-the-art --- p.58 / Chapter 5.3 --- Problem formulation --- p.62 / Chapter 5.4 --- Methods --- p.62 / Chapter 5.4.1 --- NGS sequencing and analysis --- p.62 / Chapter 5.4.2 --- Data preparation and transformation --- p.64 / Chapter 5.4.3 --- Random forest (RF) classification and regression --- p.65 / Chapter 5.5 --- Results --- p.68 / Chapter 5.5.1 --- Genome wide profiling of methylation --- p.68 / Chapter 5.5.2. --- Aggregation plot of methylation levels at different regions --- p.72 / Chapter 5.5.3. --- Scatterplot between methylation and gene expression --- p.75 / Chapter 5.5.4 --- Predictive model of gene expression using DNA methylation features --- p.76 / Chapter 5.5.5 --- Comb-model based on the full dataset --- p.87 / Chapter 5.6 --- Discussion and conclusion --- p.98 / Chapter Chapter 6 --- RNA-Seq data analysis and applications --- p.99 / Chapter 6.1 --- Transcriptional Profiling of Angiogenesis Activities of Calycosin in Zebrafish --- p.99 / Chapter 6.1.1 --- Introduction --- p.99 / Chapter 6.1.2 --- Background --- p.100 / Chapter 6.1.3 --- Materials and methods and ethics statement --- p.101 / Chapter 6.1.4 --- Results --- p.104 / Chapter 6.1.5 --- Conclusion --- p.108 / Chapter 6.2 --- An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database). --- p.110 / Chapter 6.2.1 --- Introduction --- p.110 / Chapter 6.2.2 --- Background --- p.110 / Chapter 6.2.3 --- Construction and content --- p.113 / Chapter 6.2.4 --- Utility and discussion --- p.116 / Chapter 6.2.5 --- Conclusion and future development --- p.119 / Chapter Chapter 7 --- Conclusion --- p.121 / Chapter 7.1 --- Conclusion --- p.121 / Chapter 7.2 --- Future work --- p.123 / Appendix --- p.124 / Chapter A1. --- Descriptive analysis of trio data --- p.124 / Chapter A2. --- Whole genome methylation level profiling --- p.125 / Chapter A3. --- Global sliding window correlation between individuals --- p.128 / Chapter A4. --- Features selected after second-run filtering --- p.133 / Bibliography --- p.135 / Chapter A. --- Publications --- p.135 / Reference --- p.135 DNA microarrays--Statistical methods Gene expression--Statistical methods
3	Bayesian variable selection for high dimensional data analysis. / CUHK electronic theses & dissertations collection January 2010 (has links) In the practice of statistical modeling, it is often desirable to have an accurate predictive model. Modern data sets usually have a large number of predictors. For example, DNA microarray gene expression data usually have the characteristics of fewer observations and larger number of variables. Hence parsimony is especially an important issue. Best-subset selection is a conventional method of variable selection. Due to the large number of variables with relatively small sample size and severe collinearity among the variables, standard statistical methods for selecting relevant variables often face difficulties. / In the third part of the thesis, we propose a Bayesian stochastic search variable selection approach for multi-class classification, which can identify relevant genes by assessing sets of genes jointly. We consider a multinomial probit model with a generalized g-prior for the regression coefficients. An efficient algorithm using simulation-based MCMC methods are developed for simulating parameters from the posterior distribution. This algorithm is robust to the choice of initial value, and produces posterior probabilities of relevant genes for biological interpretation. We demonstrate the performance of the approach with two well- known gene expression profiling data: leukemia data and lymphoma data. Compared with other classification approaches, our approach selects smaller numbers of relevant genes and obtains competitive classification accuracy based on obtained results. / The last part of the thesis is about the further research, which presents a stochastic variable selection approach with different two-level hierarchical prior distributions. These priors can be used as a sparsity-enforcing mechanism to perform gene selection for classification. Using simulation-based MCMC methods for simulating parameters from the posterior distribution, an efficient algorithm can be developed and implemented. / The second part of the thesis proposes a Bayesian stochastic variable selection approach for gene selection based on a probit regression model with a generalized singular g-prior distribution for regression coefficients. Using simulation-based MCMC methods for simulating parameters from the posterior distribution, an efficient and dependable algorithm is implemented. It is also shown that this algorithm is robust to the choice of initial values, and produces posterior probabilities of related genes for biological interpretation. The performance of the proposed approach is compared with other popular methods in gene selection and classification via the well known colon cancer and leukemia data sets in microarray literature. / Yang, Aijun. / Adviser: Xin-Yuan Song. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 89-98). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Bayesian statistical decision theory Gene expression--Statistical methods Mathematical statistics
4	Population genetic analysis of the black blow fly Phormia regina (Meigen) (Diptera: Calliphoridae) Whale, John W. January 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The black blow fly, Phormia regina (Diptera: Calliphoridae), is a widely abundant fly autochthonous to North America. Like many other Calliphorids, P. regina plays a key role in several disciplines particularly in estimating post-mortem intervals (PMI). The aim of this work was to better understand the population genetic structure of this important ecological species using microsatellites from populations collected in the U.S. during 2008 and 2013. Additionally, it sought to determine the effect of limited genetic diversity on a quantitative trait throughout immature development; larval length, a measurement used to estimate specimen age. Observed heterozygosity was lower than expected at five of the six loci and ranged from 0.529-0.880 compared to expected heterozygosity that ranged from 0.512-0.980, this is indicative of either inbreeding or the presence of null alleles. Kinship coefficients indicate that individuals within each sample are not strongly related to one another; values for the wild-caught populations ranged from 0.033-0.171 and a high proportion of the genetic variation (30%) can be found among samples within regions. The population structure of this species does not correlate well to geography; populations are different to one another resulting from a lack of gene flow irrespective of geographic distance, thus inferring temporal distance plays a greater role on the genetic variation of P. regina. Among colonized samples, flies lost much of their genetic diversity, ≥67% of alleles per locus were lost, and population samples became increasingly more related; kinship coefficient values increased from 0.036 for the wild-caught individuals to 0.261 among the F10 specimens. Colonized larvae also became shorter in length following repeated inbreeding events, with the longest recorded specimen in F1 18.75 mm in length while the longest larva measured in F11 was 1.5 mm shorter at 17.25 mm. This could have major implications in forensic entomology, as the largest specimen is often assumed to be the oldest on the corpse and is subsequently used to estimate a postmortem interval. The reduction in length ultimately resulted in a greater proportion of individuals of a similar length; the range of data became reduced. Consequently, the major reduction in genetic diversity indicates that the loss in the spread of length distributions of the larvae may have a genetic influence or control. Therefore, this data highlights the importance when undertaking either genetic or development studies, particularly of blow flies such as Phormia regina, that collections of specimens and populations take place not only from more than one geographic location, but more importantly from more than one temporal event. Life cycles (Biology) -- Genetic aspects Molecular biology -- Mathematical models Gene expression -- Statistical methods Postmortem changes
5	De novo genome assembly of the blow fly Phormia regina (Diptera: Calliphoridae) Andere, Anne A. January 2014 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Phormia regina (Meigen), commonly known as the black blow fly is a dipteran that belongs to the family Calliphoridae. Calliphorids play an important role in various research fields including ecology, medical studies, veterinary and forensic sciences. P. regina, a non-model organism, is one of the most common forensically relevant insects in North America and is typically used to assist in estimating postmortem intervals (PMI). To better understand the roles P. regina plays in the numerous research fields, we re-constructed its genome using next generation sequencing technologies. The focus was on generating a reference genome through de novo assembly of high-throughput short read sequences. Following assembly, genetic markers were identified in the form of microsatellites and single nucleotide polymorphisms (SNPs) to aid in future population genetic surveys of P. regina. A total 530 million 100 bp paired-end reads were obtained from five pooled male and female P. regina flies using the Illumina HiSeq2000 sequencing platform. A 524 Mbp draft genome was assembled using both sexes with 11,037 predicted genes. The draft reference genome assembled from this study provides an important resource for investigating the genetic diversity that exists between and among blow fly species; and empowers the understanding of their genetic basis in terms of adaptations, population structure and evolution. The genomic tools will facilitate the analysis of genome-wide studies using modern genomic techniques to boost a refined understanding of the evolutionary processes underlying genomic evolution between blow flies and other insect species. Life cycles (Biology) -- Genetic aspects DNA microarrays -- Statistical methods Gene expression -- Statistical methods Forensic entomology -- Research Genomics -- Technological innovations Molecular biology -- Mathematical models Combinatorial analysis Graph theory Bioinformatics -- Research Postmortem changes

1

Page generated in 0.2247 seconds