Spelling suggestions: "subject:"9nucleotide sequence - data processing."" "subject:"9nucleotide sequence - mata processing.""
1 |
Efficient methods for improving the sensitivity and accuracy of RNA alignments and structure predictionLi, Yaoman, 李耀满 January 2013 (has links)
RNA plays an important role in molecular biology. RNA sequence comparison is an important method to analysis the gene expression. Since aligning RNA reads needs to handle gaps, mutations, poly-A tails, etc. It is much more difficult than aligning other sequences. In this thesis, we study the RNA-Seq align tools, the existing gene information database and how to improve the accuracy of alignment and predict RNA secondary structure.
The known gene information database contains a lot of reliable gene information that has been discovered. And we note most DNA align tools are well developed. They can run much faster than existing RNA-Seq align tools and have higher sensitivity and accuracy. Combining with the known gene information database, we present a method to align RNA-Seq data by using DNA align tools. I.e. we use the DNA align tools to do alignment and use the gene information to convert the alignment to genome based.
The gene information database, though updated daily, there are still a lot of genes and alternative splicings that hadn't been discovered. If our RNA align tool only relies on the known gene database, then there may be a lot reads that come from unknown gene or alternative splicing cannot be aligned. Thus, we show a combinational method that can cover potential alternative splicing junction sites. Combining with the original gene database, the new align tools can cover most alignments which are reported by other RNA-Seq align tools.
Recently a lot of RNA-Seq align tools have been developed. They are more powerful and faster than the old generation tools. However, the RNA read alignment is much more complicated than other sequence alignment. The alignments reported by some RNA-Seq align tools have low accuracy. We present a simple and efficient filter method based on the quality score of the reads. It can filter most low accuracy alignments.
At last, we present a RNA secondary prediction method that can predict pseudoknot(a type of RNA secondary structure) with high sensitivity and specificity. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
2 |
Binning and annotation for metagenomic next-generation sequencing readsWang, Yi, 王毅 January 2014 (has links)
The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions.
There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance.
The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient.
Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
3 |
Deciphering the mechanisms of genetic disorders by high throughput genomic dataBao, Suying, 鲍素莹 January 2013 (has links)
A new generation of non-Sanger-based sequencing technologies, so called “next-generation” sequencing (NGS), has been changing the landscape of genetics at unprecedented speed. In particular, our capacity in deciphering the genotypes underlying phenotypes, such as diseases, has never been greater. However, before fully applying NGS in medical genetics, researchers have to bridge the widening gap between the generation of massively parallel sequencing output and the capacity to analyze the resulting data. In addition, even a list of candidate genes with potential causal variants can be obtained from an effective NGS analysis, to pinpoint disease genes from the long list remains a challenge. The issue becomes especially difficult when the molecular basis of the disease is not fully elucidated.
New NGS users are always bewildered by a plethora of options in mapping, assembly, variant calling and filtering programs and may have no idea about how to compare these tools and choose the “right” ones. To get an overview of various bioinformatics attempts in mapping and assembly, a series of performance evaluation work was conducted by using both real and simulated NGS short reads. For NGS variant detection, the performances of two most widely used toolkits were assessed, namely, SAM tools and GATK. Based on the results of systematic evaluation, a NGS data processing and analysis pipeline was constructed. And this pipeline was proved a success with the identification of a mutation (a frameshift deletion on Hnrnpa1, p.Leu181Valfs*6) related to congenital heart defect (CHD) in procollagen type IIA deficient mice.
In order to prioritize risk genes for diseases, especially those with limited prior knowledge, a network-based gene prioritization model was constructed. It consists of two parts: network analysis on known disease genes (seed-based network strategy)and network analysis on differential expression (DE-based network strategy). Case studies of various complex diseases/traits demonstrated that the DE-based network strategy can greatly outperform traditional gene expression analysis in predicting disease-causing genes. A series of simulation work indicated that the DE-based strategy is especially meaningful to diseases with limited prior knowledge, and the model’s performance can be further advanced by integrating with seed-based network strategy. Moreover, a successful application of the network-based gene prioritization model in influenza host genetic study further demonstrated the capacity of the model in identifying promising candidates and mining of new risk genes and pathways not biased toward our current knowledge.
In conclusion, an efficient NGS analysis framework from the steps of quality control and variant detection, to those of result analysis and gene prioritization has been constructed for medical genetics. The novelty in this framework is an encouraging attempt to prioritize risk genes for not well-characterized diseases by network analysis on known disease genes and differential expression data. The successful applications in detecting genetic factors associated with CHD and influenza host resistance demonstrated the efficacy of this framework. And this may further stimulate more applications of high throughput genomic data in dissecting the genetic components of human disorders in the near future. / published_or_final_version / Biochemistry / Doctoral / Doctor of Philosophy
|
4 |
Motif discovery for DNA sequencesLeung, Chi-ming, 梁志銘 January 2006 (has links)
published_or_final_version / abstract / Computer Science / Doctoral / Doctor of Philosophy
|
5 |
Iterative de Bruijn graph assemblers for second-generation sequencing readsPeng, Yu, 彭煜 January 2012 (has links)
The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps.
In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data.
For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support.
Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
6 |
Data mining of post genome-wide association studies and next generation sequencing桂宏胜, Gui, Hongsheng January 2013 (has links)
abstract / Psychiatry / Doctoral / Doctor of Philosophy
|
7 |
Predicting functional impact of nonsynonymous mutations by quantifying conservation information and detect indels using split-read approachZeng, Shuai, 曾帥 January 2014 (has links)
The rapidly developing sequencing technology has brought up an opportunity to scientists to look into the detailed genotype information in human genome. Computational programs have played important roles in identifying disease related genomic variants from huge amount of sequencing data.
In the past years, a number of computational algorithms have been developed, solving many crucial problems in sequencing data analysis, such as mapping sequencing reads to genome and identifying SNPs. However, many difficult and important issues are still expecting satisfactory solutions. A key challenge is identifying disease related mutations in the background of non-pathogenic polymorphisms. Another crucial problem is detecting INDELs especially the long deletions under the technical limitations of second generation sequencing technology.
To predict disease related mutations, we developed a machine learning-based (Random forests) prediction tool, EFIN (Evaluation of Functional Impact of Nonsynonymous mutations). We build A Multiple Sequence Alignment (MSA) for a querying protein with its homologous sequences. MSA is later divided into different blocks according to taxonomic information of the sequences. After that, we quantified the conservation in each block using a number of selected features, for example, entropy, a concept borrowed from information theory. EFIN was trained by Swiss-Prot and HumDiv datasets. By a series of fair comparisons, EFIN showed better results than the widely-used algorithms in terms of AUC (Area under ROC curve), accuracy, specificity and sensitivity. The web-based database is provided to worldwide user at paed.hku.hk/efin.
To solve the second problem, we developed Linux-based software, SPLindel that detects deletions (especially long deletions) and insertions using second generation sequencing data. For each sample, SPLindel uses split-read method to detect the candidate INDELs by building alternative references to go along with the reference sequences. And then we remap all the relevant reads using both original references and alternative allele references. A Bayesian model integrating paired-end information was used to assign the reads to the most likely locations on either the original reference allele or the alternative allele. Finally we count the number of reads that support the alternative allele (with insertion or deletions comparing to the original reference allele) and the original allele, and fit a beta-binomial mixture model. Based on this model, the likelihood for each INDEL is calculated and the genotype is predicted. SPLindel runs about the same speed as GATK and DINDEL, but much faster than DINDEL. SPLindel obtained very similar results as GATK and DINDEL for the INDELs of size 1-15 bps, but is much more effective in detecting INDELs of larger size.
Using machine learning method and statistical modeling technology, we proposed the tools to solve these two important problems in sequencing data analysis. This study will help identify novel damaging nsSNPs more accurately and efficiently, and equip researcher with more powerful tool in identifying INDELs, especially long deletions. As more and more sequencing data are generated, methods and tools introduced in this thesis may help us extract useful information to facilitate identification of causal mutations to human diseases. / published_or_final_version / Paediatrics and Adolescent Medicine / Doctoral / Doctor of Philosophy
|
8 |
A study on predicting gene relationship from a computational perspectiveChan, Pui-yee., 陳沛儀. January 2004 (has links)
published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy
|
9 |
Approximate string alignment and its application to ESTs, mRNAs and genome mappingYim, Cheuk-hon, Terence., 嚴卓漢. January 2004 (has links)
published_or_final_version / abstract / Computer Science and Information Systems / Master / Master of Philosophy
|
10 |
Bioinformatics analyses for next-generation sequencing of plasma DNA.January 2012 (has links)
1997年,Dennis等證明胚胎DNA在孕婦母體中存在的事實開啟了產前無創診斷的大門。起初的應用包括性別鑒定和恒河猴血型系統的識別。隨著二代測序的出現和發展,對外周血游離DNA更加成熟的分析和應用應運而生。例如當孕婦懷孕十二周時, 應用二代測序技術在母體外周血DNA中預測胎兒21號染色體是否是三倍體, 其準確性達到98%。本論文的第一部分介紹如何應用母體外周血DNA構建胎兒的全基因組遺傳圖譜。這項研究極具挑戰,原因是孕後12周,胎兒對外周血DNA貢獻很小,大多數在10%左右,另外外周血中的胎兒DNA大多數短於200 bp。目前的演算法和程式都不適合於從母體外周血DNA中構建胎兒的遺傳圖譜。在這項研究中,根據母親和父親的基因型,用生物資訊學手段先構建胎兒可能有的遺傳圖譜,然後將母體外周血DNA的測序資訊比對到這張可能的遺傳圖譜上。如果在母親純和遺傳背景下,決定父親的特異遺傳片段,只要定性檢測父親的特異遺傳片段是否在母體外周血中存在。如果在母親雜合遺傳背景下,決定母親的遺傳特性,就要進行定量分析。我開發了單倍型相對劑量分析方案,統計學上判斷母親外周血中的兩條單倍型相對劑量水準,顯著增加的單倍型即為最大可能地遺傳給胎兒的單倍型。單倍型相對劑量分析方案可以加強測序資訊的分析效率,降低測序數據波動,比單個位點分析更加穩定,強壯。 / 隨著靶標富集測序出現,測序價格急劇下降。第一部分運用母親父親的多態位點基因型的組合加上測序的資訊可以計算出胎兒DNA在母體外周血中的濃度。但是該方法的局限是要利用母親父親的多態位點的基因型,而不能直接從測序的資訊中推測胎兒DNA在母體外周血中的濃度。本論文的第二部分,我開發了基於二項分佈的混合模型直接預測胎兒DNA在母體外周血中的濃度。當混合模型的似然值達到最大的時候,胎兒DNA在母體外周血中的濃度得到最優估算。由於靶標富集測序可以提供高倍覆蓋的測序資訊,從而有機會直接根據概率模型識別出母親是純和而且胎兒是雜合的有特異信息量的位點。 / 除了母體外周血DNA水準分析推動產前無創診斷外,表觀遺傳學的分析也不容忽視。 在本論文的第三部分,我開發了Methyl-Pipe軟體,專門用於全基因組的甲基化的分析。甲基化測序數據分析比一般的基因組測序分析更加複雜。由於重亞硫酸鹽測序文庫的沒有甲基化的胞嘧啶轉化成尿嘧啶,最後以胸腺嘧啶的形式存在PCR產物中, 但是對於甲基化的胞嘧啶則保持不變。 因此,為了實現將重亞硫酸鹽處理過的測序序列比對到參考基因組。首先,分別將Watson和Crick鏈的參考基因組中胞嘧啶轉化成全部轉化為胸腺嘧啶,同時也將測序序列中的胞嘧啶轉化成胸腺嘧啶。然後將轉化後的測序序列比對到參考基因組上。最後根據比對到基因組上的測序序列中的胞嘧啶和胸腺嘧啶的含量推到全基因組的甲基化水準和甲基化特定模式。Methyl-Pipe可以用於識別甲基化水平顯著性差異的基因組區別,因此它可以用於識別潛在的胎兒特異的甲基化位點用於產前無創診斷。 / The presence of fetal DNA in the cell-free plasma of pregnant women was first described in 1997. The initial clinical applications of this phenomenon focused on the detection of paternally inherited traits such as sex and rhesus D blood group status. The development of massively parallel sequencing technologies has allowed more sophisticated analyses on circulating cell-free DNA in maternal plasma. For example, through the determination of the proportional representation of chromosome 21 sequences in maternal plasma, noninvasive prenatal diagnosis of fetal Down syndrome can be achieved with an accuracy of >98%. In the first part of my thesis, I have developed bioinformatics algorithms to perform genome-wide construction of the fetal genetic map from the massively parallel sequencing data of the maternal plasma DNA sample of a pregnant woman. The construction of the fetal genetic map through the maternal plasma sequencing data is very challenging because fetal DNA only constitutes approximately 10% of the maternal plasma DNA. Moreover, as the fetal DNA in maternal plasma exists as short fragments of less than 200 bp, existing bioinformatics techniques for genome construction are not applicable for this purpose. For the construction of the genome-wide fetal genetic map, I have used the genome of the father and the mother as scaffolds and calculated the fractional fetal DNA concentration. First, I looked at the paternal specific sequences in maternal plasma to determine which portions of the father’s genome had been passed on to the fetus. For the determination of the maternal inheritance, I have developed the Relative Haplotype Dosage (RHDO) approach. This method is based on the principle that the portion of maternal genome inherited by the fetus would be present in slightly higher concentration in the maternal plasma. The use of haplotype information can enhance the efficacy of using the sequencing data. Thus, the maternal inheritance can be determined with a much lower sequencing depth than just looking at individual loci in the genome. This algorithm makes it feasible to use genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way. / As the emergence of targeted massively parallel sequencing, the sequencing cost per base is reducing dramatically. Even though the first part of the thesis has already developed a method to estimate fractional fetal DNA concentration using parental genotype informations, it still cannot be used to deduce the fractional fetal DNA concentration directly from sequencing data without prior knowledge of genotype information. In the second part of this thesis, I propose a statistical mixture model based method, FetalQuant, which utilizes the maximum likelihood to estimate the fractional fetal DNA concentration directly from targeted massively parallel sequencing of maternal plasma DNA. This method allows fetal DNA concentration estimation superior to the existing methods in term of obviating the need of genotype information without loss of accuracy. Furthermore, by using Bayes’ rule, this method can distinguish the informative SNPs where mother is homozygous and fetus is heterozygous, which is potential to detect dominant inherited disorder. / Besides the genetic analysis at the DNA level, epigenetic markers are also valuable for noninvasive diagnosis development. In the third part of this thesis, I have also developed a bioinformatics algorithm to efficiently analyze genomewide DNA methylation status based on the massively parallel sequencing of bisulfite-converted DNA. DNA methylation is one of the most important mechanisms for regulating gene expression. The study of DNA methylation for different genes is important for the understanding of the different physiological and pathological processes. Currently, the most popular method for analyzing DNA methylation status is through bisulfite sequencing. The principle of this method is based on the fact that unmethylated cytosine residues would be chemically converted to uracil on bisulfite treatment whereas methylated cytosine would remain unchanged. The converted uracil and unconverted cytosine can then be discriminated on sequencing. With the emergence of massively parallel sequencing platforms, it is possible to perform this bisulfite sequencing analysis on a genome-wide scale. However, the bioinformatics analysis of the genome-wide bisulfite sequencing data is much more complicated than analyzing the data from individual loci. Thus, I have developed Methyl-Pipe, a bioinformatics program for analyzing the DNA methylation status of genome-wide methylation status of DNA samples based on massively parallel sequencing. In the first step of this algorithm, an in-silico converted reference genome is produced by converting all the cytosine residues to thymine residues. Then, the sequenced reads of bisulfite-converted DNA sequences are aligned to this modified reference sequence. Finally, post-processing of the alignments removes non-unique and low-quality mappings and characterizes the methylation pattern in genome-wide manner. Making use of this new program, potential fetal-specific hypomethylated regions which can be used as blood biomarkers can be identified in a genome-wide manner. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Jiang, Peiyong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 100-105). / Abstracts also in Chinese. / Chapter SECTION I : --- BACKGROUND --- p.1 / Chapter CHAPTER 1: --- Circulating nucleic acids and Next-generation sequencing --- p.2 / Chapter 1.1 --- Circulating nucleic acids --- p.2 / Chapter 1.2 --- Next-generation sequencing --- p.3 / Chapter 1.3 --- Bioinformatics analyses --- p.9 / Chapter 1.4 --- Applications of the NGS --- p.11 / Chapter 1.5 --- Aims of this thesis --- p.12 / Chapter SECTION II : --- Mathematically decoding fetal genome in maternal plasma --- p.14 / Chapter CHAPTER 2: --- Characterizing the maternal and fetal genome in plasma at single base resolution --- p.15 / Chapter 2.1 --- Introduction --- p.15 / Chapter 2.2 --- SNP categories and principle --- p.17 / Chapter 2.3 --- Clinical cases and SNP genotyping --- p.20 / Chapter 2.4 --- Sequencing depth and fractional fetal DNA concentration determination --- p.24 / Chapter 2.5 --- Filtering of genotyping errors for maternal genotypes --- p.26 / Chapter 2.6 --- Constructing fetal genetic map in maternal plasma --- p.27 / Chapter 2.7 --- Sequencing error estimation --- p.36 / Chapter 2.8 --- Paternal-inherited alleles --- p.38 / Chapter 2.9 --- Maternally-derived alleles by RHDO analysis --- p.39 / Chapter 2.1 --- Recombination breakpoint simulation and detection --- p.49 / Chapter 2.11 --- Prenatal diagnosis of β- thalassaemia --- p.51 / Chapter 2.12 --- Discussion --- p.53 / Chapter SECTION III : --- Statistical model for fractional fetal DNA concentration estimation --- p.56 / Chapter CHAPTER 3: --- FetalQuant: deducing the fractional fetal DNA concentration from massively parallel sequencing of maternal plasma DNA --- p.57 / Chapter 3.1 --- Introduction --- p.57 / Chapter 3.2 --- Methods --- p.60 / Chapter 3.2.1 --- Maternal-fetal genotype combinations --- p.60 / Chapter 3.2.2 --- Binomial mixture model and likelihood --- p.64 / Chapter 3.2.3 --- Fractional fetal DNA concentration fitting --- p.66 / Chapter 3.3 --- Results --- p.71 / Chapter 3.3.1 --- Datasets --- p.71 / Chapter 3.3.2 --- Evaluation of FetalQuant algorithm --- p.75 / Chapter 3.3.3 --- Simulation --- p.78 / Chapter 3.3.4 --- Sequencing depth and the number of SNPs required by FetalQuant --- p.81 / Chapter 3.5 --- Discussion --- p.85 / Chapter SECTION IV : --- NGS-based data analysis pipeline development --- p.88 / Chapter CHAPTER 4: --- Methyl-Pipe: Methyl-Seq bioinformatics analysis pipeline --- p.89 / Chapter 4.1 --- Introduction --- p.89 / Chapter 4.2 --- Methods --- p.89 / Chapter 4.2.1 --- Overview of Methyl-Pipe --- p.90 / Chapter 4.3 --- Results and discussion --- p.96 / Chapter SECTION V : --- CONCLUDING REMARKS --- p.97 / Chapter CHAPTER 5: --- Conclusion and future perspectives --- p.98 / Chapter 5.1 --- Conclusion --- p.98 / Chapter 5.2 --- Future perspectives --- p.99 / Reference --- p.100
|
Page generated in 0.1302 seconds