51 |
Some topics in dimension reduction and clusteringZhao, Jianhua, 赵建华 January 2009 (has links)
published_or_final_version / Statistics and Actuarial Science / Doctoral / Doctor of Philosophy
|
52 |
Transcriptome analysis and applications based on next-generation RNA sequencing data. / CUHK electronic theses & dissertations collectionJanuary 2012 (has links)
二代cDNA测序技术,又名“RNA-Seq“,为转录组(transcriptome)的研究提供了新的手段。作为革命性的技术方法,RNA-Seq 不仅可以帮助准确测量转录体(transcript)的表达水平,更可以发现新的转录体和揭示转录调控的机理。同时,整合多个不同水平的测序数据,例如基因组(genome)测序,甲基化组(methylome)测序等,可以为深入挖掘生物学意义提供一个强有力的的工具。 / 我的博士研究主要集中在二代测序(next-generation sequencing,NGS),特别是RNA-Seq数据的分析。它主要包含三部分:分析工具开发,数据分析和机理研究。 / 大量测序数据的分析对于二代测序技术来说是一个重大的挑战。目前,相对于剪接比对工具(splice-aware aligner),普通比对工具可以极速(ultrafast)的将数以千万记的短序列(Reads)比对到基因组,但是他们很难处理那些跨过剪接位点(splice junction)的短序列(spliced reads)或者匹配多个基因组位置的短序列(multireads)。我们开发了一个利用two-seed策略的全新的序列比对工具-ABMapper。基准测试(Benchmark test) 结果显示ABMapper比其他的同类工具:TopHat和SpliceMap有更高的accuracy和recall。另一方面,spliced reads和multireads在基因组上会有多个匹配的位置,选择最可能的位置也成为一个大问题。在计算基因表达值时,multireads和spliced reads常会被随机的选定其中之一,或者直接被排除。这种处理方式会引入偏差而直接影响下游(downstream)分析的准确性。为了解决multireads和spliced reads位置选择问题,我们提出了一个利用内含子(intron)长度的Geometric-tail (GT) 经验分布的最大似然估计 (maximum likelihood estimation) 的方法。这个概率模型可以适用于剪接位点位于短序列上或者位于成对短序列(Pair-ended, PE) 之间的情况。基于这个模型,我们可以更好的确定那些在基因组上存在多个匹配的成对短序列(pair-ended, PE reads)的最可能位置。 / 测序数据的积累为深入研究生物学意义提供了丰富的资源。利用RNA-Seq数据和甲基化测序数据,我们建立了一个基于DNA甲基化模式 (pattern) 的基因表达水平的预测模型。根据这个模型,我们发现DNA甲基化可以相当准确的预测基因表达水平,准确率达到78%。我们还发现基因主体上的DNA甲基化比启动子 (promoter) 附近的更重要。最后我们还从整合所有甲基化模式和CpG模式的组合数据集中,利用特征筛选(feature selection)选择了一个最优化子集。我们基于最优子集建立了特征重叠作用网络,进一步揭示了DNA甲基化模式对于基因表达的协作调控机理。 / 除了开发RNA-Seq数据分析的工具和数据挖掘,我们还分析斑马鱼(zebrafish)的转录组(transcriptome)。RNA-Seq数据分析结合荧光成像,定量PCR等生物学实验,揭示了Calycosin处理之后的相关作用通路(pathway)和差异表达基因,分析结果还证明了Calycosin在体内的血管生成活性。 / 综上所述,本论文将会详细阐述我在二代测序数据分析,基于数据挖掘的生物学意义的发现和转录组分析方面的工作。 / The recent development of next generation RNA-sequencing, termed ‘RNA-Seq’, has offered an opportunity to explore the RNA transcripts from the whole transcriptome. As a revolutionary method, RNA-Seq not only could precisely measure the abundances of transcripts, but discover the novel transcribed contents and uncover the unknown regulatory mechanisms. Meanwhile, the combination of different levels of next-generation sequencing, such as genome sequencing and methylome sequencing has provided a powerful tool for novel discovery in the biological context. / My PhD study focuses on the analysis of next-generation sequencing data, especially on RNA-Seq data. It mainly includes three parts: pipeline development analysis, data analysis and mechanistic study. / As the next-generation sequencing (NGS) technology, the analysis of massive NGS data is a great challenge. Many existing general aligners (as contrast to splicing-aware alignment tools) are capable of mapping millions of sequencing reads onto a reference genome. However, they are neither designed for reads that span across splice junctions (spliced reads) nor for reads that could match multiple locations along the reference genome (multireads). Hence, we have developed an ab initio mapping method - ABMapper, using two-seed strategy. The benchmark results show that ABMapper can get higher accuracy and recall compared with the same kind of tools: TopHat and SpliceMap. On the other hand, the selection of the most probable location for spliced reads and multireads becomes a big problem. These reads are randomly assigned to one of the possible locations or discarded completely when calculating the expression level, which would bias the downstream analysis, such as the differentiated expression analysis and alternative splicing analysis. To rationally determine the location of spliced reads and multireads, we have proposed a maximum likelihood estimation method based on a geometric-tail (GT) distribution of intron length. This probabilistic model deals with splice junctions between reads, or those encompassed in one or both of a pair-ended (PE) reads. Based on this model, multiple alignments of reads within a PE pair can be properly resolved. / The accumulation of NGS data has provided rich resources for deep discovery of biological significance. We have integrated RNA-Seq data and methylation sequencing data to build a predictive model for the regulation of gene expression based on DNA methylation patterns. We found that DNA methylation could predict gene expression fairly accurately and the accuracy can reach up to 78%. We have also found DNA methylation at gene body is the most important region in these models, even more useful than promoter. Finally, feature overlap network based on an optimum subset of combination of all methylation patterns and CpG patterns has indicated the collaborative regulation of gene expression by DNA methylation patterns. / Not only new algorithms were developed to facilitate the RNA-Seq data analysis, but the transcriptome analysis was performed on zebrafish. The analysis of differentially-expressed genes and pathways involved after calycosin treatment, combined with other experimental evidence such as fluorescence microscopy and quantitative real-time polymerase chain reaction (qPCR), has well demonstrated the proangiogenic effects of calycosin in vivo. / In summary, this thesis detailed my work on NGS data analysis, discovery of biological significance using data-mining algorithms and transcriptome analysis. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Lou, Shaoke. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 135-146). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / 摘要 --- p.iii / Acknowledgement --- p.v / Chapter Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Bioinformatics --- p.1 / Chapter 1.2 --- Bioinformatics application --- p.1 / Chapter 1.3 --- Motivation --- p.2 / Chapter 1.4 --- Objectives --- p.3 / Chapter 1.5 --- Thesis outline --- p.3 / Chapter Chapter 2 --- Background --- p.4 / Chapter 2.1 --- Biological and biotechnology background --- p.4 / Chapter 2.1.1 --- Central dogma and biology ABC --- p.4 / Chapter 2.1.2 --- Transcription --- p.5 / Chapter 2.1.3 --- Splicing and Alternative Splicing --- p.6 / Chapter 2.1.4 --- Next-generation Sequencing --- p.10 / Chapter 2.1.5 --- RNA-Seq --- p.18 / Chapter 2.2 --- Computational background --- p.20 / Chapter 2.2.1 --- Approximate string matching and read mapping --- p.21 / Chapter 2.2.2 --- Read mapping algorithms and tools --- p.22 / Chapter 2.2.3 --- Spliced alignment tools --- p.27 / Chapter Chapter 3 --- ABMapper: a two-seed based spliced alignment tool --- p.29 / Chapter 3.1 --- Introduction --- p.29 / Chapter 3.2 --- State-of-the-art --- p.30 / Chapter 3.3 --- Problem formulation --- p.31 / Chapter 3.4 --- Methods --- p.33 / Chapter 3.5 --- Results --- p.35 / Chapter 3.5.1 --- Benchmark test --- p.35 / Chapter 3.5.2 --- Complexity analysis --- p.39 / Chapter 3.5.3 --- Comparison with other tools --- p.39 / Chapter 3.6 --- Discussion and conclusion --- p.41 / Chapter Chapter 4 --- Geometric-tail (GT) model for rational selection of RNA-Seq read location --- p.42 / Chapter 4.1 --- Introduction --- p.42 / Chapter 4.2 --- State-of-the-art --- p.44 / Chapter 4.3 --- Problem formulation --- p.44 / Chapter 4.4 --- Algorithms --- p.45 / Chapter 4.5 --- Results --- p.49 / Chapter 4.5.1 --- Workflow of GT MLE method --- p.49 / Chapter 4.5.2 --- GT distribution and insert-size distribution --- p.50 / Chapter 4.5.3 --- Multiread analysis --- p.51 / Chapter 4.5.4 --- Splice-site comparison --- p.52 / Chapter 4.6 --- Discussion and conclusion --- p.55 / Chapter Chapter 5 --- Explore relationship between methylation patterns and gene expression --- p.56 / Chapter 5.1 --- Introduction --- p.56 / Chapter 5.2 --- State-of-the-art --- p.58 / Chapter 5.3 --- Problem formulation --- p.62 / Chapter 5.4 --- Methods --- p.62 / Chapter 5.4.1 --- NGS sequencing and analysis --- p.62 / Chapter 5.4.2 --- Data preparation and transformation --- p.64 / Chapter 5.4.3 --- Random forest (RF) classification and regression --- p.65 / Chapter 5.5 --- Results --- p.68 / Chapter 5.5.1 --- Genome wide profiling of methylation --- p.68 / Chapter 5.5.2. --- Aggregation plot of methylation levels at different regions --- p.72 / Chapter 5.5.3. --- Scatterplot between methylation and gene expression --- p.75 / Chapter 5.5.4 --- Predictive model of gene expression using DNA methylation features --- p.76 / Chapter 5.5.5 --- Comb-model based on the full dataset --- p.87 / Chapter 5.6 --- Discussion and conclusion --- p.98 / Chapter Chapter 6 --- RNA-Seq data analysis and applications --- p.99 / Chapter 6.1 --- Transcriptional Profiling of Angiogenesis Activities of Calycosin in Zebrafish --- p.99 / Chapter 6.1.1 --- Introduction --- p.99 / Chapter 6.1.2 --- Background --- p.100 / Chapter 6.1.3 --- Materials and methods and ethics statement --- p.101 / Chapter 6.1.4 --- Results --- p.104 / Chapter 6.1.5 --- Conclusion --- p.108 / Chapter 6.2 --- An integrated web medicinal materials DNA database: MMDBD (Medicinal Materials DNA Barcode Database). --- p.110 / Chapter 6.2.1 --- Introduction --- p.110 / Chapter 6.2.2 --- Background --- p.110 / Chapter 6.2.3 --- Construction and content --- p.113 / Chapter 6.2.4 --- Utility and discussion --- p.116 / Chapter 6.2.5 --- Conclusion and future development --- p.119 / Chapter Chapter 7 --- Conclusion --- p.121 / Chapter 7.1 --- Conclusion --- p.121 / Chapter 7.2 --- Future work --- p.123 / Appendix --- p.124 / Chapter A1. --- Descriptive analysis of trio data --- p.124 / Chapter A2. --- Whole genome methylation level profiling --- p.125 / Chapter A3. --- Global sliding window correlation between individuals --- p.128 / Chapter A4. --- Features selected after second-run filtering --- p.133 / Bibliography --- p.135 / Chapter A. --- Publications --- p.135 / Reference --- p.135
|
53 |
Simulation for tests on the validity of the assumption that the underlying distribution of life is exponentialThoppil, Anjo January 2010 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries
|
54 |
Detection of parent-of-origin effects and association in relation to aquantitative traitHe, Feng, 贺峰 January 2010 (has links)
published_or_final_version / Statistics and Actuarial Science / Master / Master of Philosophy
|
55 |
On the evaluation and statistical analysis of forensic evidence in DNAmixturesChung, Yuk-ka., 鍾玉嘉. January 2011 (has links)
published_or_final_version / Statistics and Actuarial Science / Doctoral / Doctor of Philosophy
|
56 |
Mining optimal technical trading rules with genetic algorithmsShen, Rujun, 沈汝君 January 2011 (has links)
In recent years technical trading rules are widely known by more and
more people, not only the academics many investors also learn to apply
them in financial markets. One approach of constructing technical
trading rules is to use technical indicators, such as moving average(MA)
and filter rules. These trading rules are widely used possibly because
the technical indicators are simple to compute and can be programmed
easily. An alternative approach of constructing technical trading rules
is to rely on some chart patterns. However, the patterns and signals
detected by these rules are often made by the visual inspection through
human eyes. As for as I know, there are no universally acceptable methods
of constructing the chart patterns. In 2000, Prof. Andrew Lo and
his colleagues are the first ones who define five pairs of chart patterns
mathematically. They are Head-and-Shoulders(HS) & Inverted Headand-
Shoulders(IHS), Broadening tops(BTOP) & bottoms(BBOT), Triangle
tops(TTOP) & bottoms(TBOT), Rectangle tops(RTOP) & bottoms(
RBOT) and Double tops(DTOP) & bottoms(DBOT).
The basic formulation of a chart pattern consists of two steps: detection
of (i) extreme points of a price series; and (ii) shape of the pattern.
In Lo et al.(2000), the method of kernel smoothing was used to identify
the extreme points. It was admitted by Lo et al. (2000) that the
optimal bandwidth used in kernel method is not the best choice and
the expert judgement is needed in detecting the bandwidth. In addition,
their work considered chart pattern detection only but no buy/sell
signal detection. It should be noted that it is possible to have a chart
pattern formed without a signal detected, but in this case no transaction
will be made. In this thesis, I propose a new class of technical
trading rules which aims to resolve the above problems. More specifically,
each chart pattern is parameterized by a set of parameters which
governs the shape of the pattern, the entry and exit signals of trades.
Then the optimal set of parameters can be determined by using genetic
algorithms (GAs). The advantage of GA is that they can deal with a
high-dimensional optimization problems no matter the parameters to
be optimized are continuous or discrete. In addition, GA can also be
convenient to use in the situation that the fitness function is not differentiable
or has a multi-modal surface. / published_or_final_version / Statistics and Actuarial Science / Master / Master of Philosophy
|
57 |
Feature-based 2D-3D registration and 3D reconstruction from a limited number of images via statistical inference for image-guidedinterventionsKang, Xin, 康欣 January 2011 (has links)
Traditional open interventions have been progressively replaced with minimally invasive techniques.
Most notably, direct visual feedback is transitioned into indirect, image-based feedback,
leading to the wide use of image-guided interventions (IGIs). One essential process of all IGIs is to
align some 3D data with 2D images of patient through a procedure called 3D-2D registration during
interventions to provide better guidance and richer information. When the 3D data is unavailable, a
realistic 3D patient-speci_c model needs to be constructed from a few 2D images.
The dominating methods that use only image intensity have narrow convergence range and are
not robust to foreign objects presented in 2D images but not existed in 3D data. Feature-based
methods partly addressed these problems, but most of them heavily rely on a set of \best" paired
correspondences and requires clean image features. Moreover, the optimization procedures used in
both kinds of methods are not e_cient.
In this dissertation, two topics have been studied and novel algorithms proposed, namely, contour
extraction from X-ray images and feature-based rigid/deformable 3D-2D registration.
Inspired by biological and neuropsychological characteristics of primary visual cortex (V1), a
contour detector is proposed for simultaneously extracting edges and lines in images. The synergy
of V1 neurons is mimicked using phase congruency and tensor voting. Evaluations and comparisons
showed that the proposed method outperformed several commonly used methods and the results are
consistent with human perception. Moreover, the cumbersome \_ne-tuning" of parameter values is
not always necessary in the proposed method.
An extensible feature-based 3D-2D registration framework is proposed by rigorously formulating
the registration as a probability density estimation problem and solving it via a generalized expectation
maximization algorithm. It optimizes the transformation directly and treats correspondences
as nuisance parameters. This is signi_cantly di_erent from almost all feature-based method in the
literature that _rst single out a set of \best" correspondences and then estimate a transformation
associated with it. This property makes the proposed algorithm not rely on paired correspondences
and thus inherently robust to outliers. The framework can be adapted as a point-based method with
the major advantages of 1) independency on paired correspondences, 2) accurate registration using
a single image, and 3) robustness to the initialization and a large amount of outliers. Extended to
a contour-based method, it di_ers from other contour-based methods mainly in that 1) it does not
rely on correspondences and 2) it incorporates gradient information via a statistical model instead of
a weighting function. Tuning into model-based deformable registration and surface reconstruction,
our method solves the problem using the maximum penalized likelihood estimation. Unlike almost
all other methods that handle the registration and deformation separately and optimized them sequentially,
our method optimizes them simultaneously. The framework was evaluated in two example
clinical applications and a simulation study for point-based, contour-based and surface reconstruction,
respectively. Experiments showed its sub-degree and sub-millimeter registration accuracy and
superiority to the state-of-the-art methods.
It is expected that our algorithms, when thoroughly validated, can be used as valuable tools for
image-guided interventions. / published_or_final_version / Orthopaedics and Traumatology / Doctoral / Doctor of Philosophy
|
58 |
Statistical process control charts with known and estimatedparametersYang, Hualong, 阳华龙 January 2013 (has links)
Monitoring and detection of abrupt changes for multivariate processes are becoming increasingly important in modern manufacturing environments. Typical equipment may have multiple key variables to be measured continuously. Hotelling's 〖T 〗^2and CUSUM charts were widely applied to solve the problem of monitoring the mean vector of multivariate quality measurements. Besides, a new multivariate cumulative sum chart (MCUSUM) is introduced where the target shift mean is assumed to be a weighted sum of principal directions of the population covariance matrix. In practical problems, estimated parameters are needed and the properties of control charts differ from the case where the parameters are known in advance. In particular, it has been observed that the average run length (ARL), a performance indicator of the control charts, is larger when the estimated parameters are used. As a first contribution we provide a general and formal proof of the phenomenon. Also, to design an efficient 〖T 〗^2 or CUSUM chart with estimated parameters, a method to calculate or approximate the ARL function is necessarily needed. A commonly used approach consists in tabulating reference values using extensive Monte-Carlo simulation. By a different approach in thesis, an analytical approximation for the ARL function in univariate case is provided, especially in-control ARL function, which can help to directly set up control limits for different sample sizes of Phase I procedure instead of conducting complex simulation. / published_or_final_version / Statistics and Actuarial Science / Master / Master of Philosophy
|
59 |
Some topics on statistical analysis of genetic imprinting data and microbiome compositional dataXia, Fan, 夏凡 January 2014 (has links)
Genetic association study is a useful tool to identify the genetic component that is responsible for a disease. The phenomenon that a certain gene expresses in a parent-of-origin manner is referred to as genomic imprinting. When a gene is imprinted, the performance of the disease-association study will be affected. This thesis presents statistical testing methods developed specially for nuclear family data centering around the genetic association studies incorporating imprinting effects. For qualitative diseases with binary outcomes, a class of TDTI* type tests was proposed in a general two-stage framework, where the imprinting effects were examined prior to association testing. On quantitative trait loci, a class of Q-TDTI(c) type tests and another class of Q-MAX(c) type tests were proposed. The proposed testing methods flexibly accommodate families with missing parental genotype and with multiple siblings. The performance of all the methods was verified by simulation studies. It was found that the proposed methods improve the testing power for detecting association in the presence of imprinting. The class of TDTI* tests was applied to a rheumatoid arthritis study data. Also, the class of Q-TDTI(c) tests was applied to analyze the Framingham Heart Study data.
The human microbiome is the collection of the microbiota, together with their genomes and their habitats throughout the human body. The human microbiome comprises an inalienable part of our genetic landscape and contributes to our metabolic features. Also, current studies have suggested the variety of human microbiome in human diseases. With the high-throughput DNA sequencing, the human microbiome composition can be characterized based on bacterial taxa relative abundance and the phylogenetic constraint. Such taxa data are often high-dimensional overdispersed and contain excessive number of zeros. Taking into account of these characteristics in taxa data, this thesis presents statistical methods to identify associations between covariate/outcome and the human microbiome composition. To assess environmental/biological covariate effect to microbiome composition, an additive logistic normal multinomial regression model was proposed and a group l1 penalized likelihood estimation method was further developed to facilitate selection of covariates and estimation of parameters. To identify microbiome components associated with biological/clinical outcomes, a Bayesian hierarchical regression model with spike and slab prior for variable selection was proposed and a Markov chain Monte Carlo algorithm that combines stochastic variable selection procedure and random walk metropolis-hasting steps was developed for model estimation. Both of the methods were illustrated using simulations as well as a real human gut microbiome dataset from The Penn Gut Microbiome Project. / published_or_final_version / Statistics and Actuarial Science / Doctoral / Doctor of Philosophy
|
60 |
The importance of lower-bound capacities in geotechnical reliability assessmentsNajjar, Shadi Sam 28 August 2008 (has links)
Not available / text
|
Page generated in 0.0836 seconds