Global ETD Search

51	Efficient methods for improving the sensitivity and accuracy of RNA alignments and structure prediction Li, Yaoman, 李耀满 January 2013 (has links) RNA plays an important role in molecular biology. RNA sequence comparison is an important method to analysis the gene expression. Since aligning RNA reads needs to handle gaps, mutations, poly-A tails, etc. It is much more difficult than aligning other sequences. In this thesis, we study the RNA-Seq align tools, the existing gene information database and how to improve the accuracy of alignment and predict RNA secondary structure. The known gene information database contains a lot of reliable gene information that has been discovered. And we note most DNA align tools are well developed. They can run much faster than existing RNA-Seq align tools and have higher sensitivity and accuracy. Combining with the known gene information database, we present a method to align RNA-Seq data by using DNA align tools. I.e. we use the DNA align tools to do alignment and use the gene information to convert the alignment to genome based. The gene information database, though updated daily, there are still a lot of genes and alternative splicings that hadn't been discovered. If our RNA align tool only relies on the known gene database, then there may be a lot reads that come from unknown gene or alternative splicing cannot be aligned. Thus, we show a combinational method that can cover potential alternative splicing junction sites. Combining with the original gene database, the new align tools can cover most alignments which are reported by other RNA-Seq align tools. Recently a lot of RNA-Seq align tools have been developed. They are more powerful and faster than the old generation tools. However, the RNA read alignment is much more complicated than other sequence alignment. The alignments reported by some RNA-Seq align tools have low accuracy. We present a simple and efficient filter method based on the quality score of the reads. It can filter most low accuracy alignments. At last, we present a RNA secondary prediction method that can predict pseudoknot(a type of RNA secondary structure) with high sensitivity and specificity. / published_or_final_version / Computer Science / Master / Master of Philosophy Nucleotide sequence - Data processing
52	A fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing Wang, Weixin, 王煒欣 January 2014 (has links) The rapid development of high-throughput sequencing technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently and accurately call genetic variants in single base level (germline single nucleotide polymorphisms (SNPs) or somatic single nucleotide variants (SNVs)) is the fundamental challenge in sequencing data analysis, because these variants reported to influence transcriptional regulation, alternative splicing, non-coding RNA regulation and protein coding. Many applications have been developed to tackle this challenge. However, the shallow depth and cellular heterogeneity make those tools cannot attain satisfactory accuracy, and the huge volume of sequencing data itself cause this process inefficient. In this dissertation, firstly the performance of prevalent reads aligners and SNP callers for second-generation sequencing (SGS) is evaluated. And due to the high GC-content, the significantly lower coverage and poorer SNP calling performance in the regulatory regions of human genome by SGS is investigated. To enhance the capability to call SNPs, especially within the lower-depth regions, a fast and accurate SNP detection (FaSD) program that uses a binomial distribution based algorithm and a mutation probability is proposed. Based on the comparison with popular software and benchmarked by SNP arrays and high-depth sequencing data, it is demonstrated that FaSD has the best SNP calling accuracy in the aspects of genotype concordance rate and AUC. Furthermore, FaSD can finish SNP calling within four hours for 10X human genome SGS data on a standard desktop computer. Lastly, combined with the joint genotype likelihoods, an updated version of FaSD is proposed to call the cancerous somatic SNVs between paired tumor and normal samples. With extensive assessments on various types of cancer, it is demonstrated that no matter benchmarked by the known somatic SNVs and germline SNPs from database, or somatic SNVs called from higher-depth data, FaSD-somatic has the best overall performance. Inherited and improved from FaSD, FaSD-somatic is also the fastest somatic SNV caller among current programs, and can finish calling somatic mutations within 14 hours for 50X paired tumor and normal samples on normal server. / published_or_final_version / Biochemistry / Doctoral / Doctor of Philosophy Chromosome polymorphism Nucleotide sequence
53	Aligning multiple sequences adaptively Ye, Yongtao, 叶永滔 January 2014 (has links) With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment. First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging. Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related. With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). / published_or_final_version / Computer Science / Master / Master of Philosophy Sequence alignment (Bioinformatics)
54	Binning and annotation for metagenomic next-generation sequencing reads Wang, Yi, 王毅 January 2014 (has links) The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient. Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy Nucleotide sequence - Data processing
55	Granulicatella, abiotrophia, and gemella bacteremia characterized by 16S ribosomal RNA gene sequencing 招紹裘, Chiu, Siu-kau. January 2002 (has links) published_or_final_version / Medical Sciences / Master / Master of Medical Sciences Bacterial genetics. Nucleotide sequence.
56	Further development of the visual genome explorer: a visual genomic comparative tool 鄭啓航, Cheng, Kai-hong. January 2001 (has links) published_or_final_version / Zoology / Master / Master of Philosophy Nucleotide sequence - Computer programs.
57	Exploiting high throughput DNA sequencing data for genomic analysis Fritz, Markus Hsi-Yang January 2012 (has links) No description available. 570.285 Nucleotide sequence ; Genomics
58	Problems in classical banach spaces Patterson, Wanda Ethel Diane McNair 12 1900 (has links) No description available. Banach spaces Sequence spaces
59	Design and Analysis of Cryptographic Pseudorandom Number/Sequence Generators with Applications in RFID Mandal, Kalikinkar 15 August 2013 (has links) This thesis is concerned with the design and analysis of strong de Bruijn sequences and span n sequences, and nonlinear feedback shift register (NLFSR) based pseudorandom number generators for radio frequency identification (RFID) tags. We study the generation of span n sequences using structured searching in which an NLFSR with a class of feedback functions is employed to find span n sequences. Some properties of the recurrence relation for the structured search are discovered. We use five classes of functions in this structured search, and present the number of span n sequences for 6 <= n <= 20. The linear span of a new span n sequence lies between near-optimal and optimal. According to our empirical studies, a span n sequence can be found in the structured search with a better probability of success. Newly found span n sequences can be used in the composited construction and in designing lightweight pseudorandom number generators. We first refine the composited construction based on a span n sequence for generating long de Bruijn sequences. A de Bruijn sequence produced by the composited construction is referred to as a composited de Bruijn sequence. The linear complexity of a composited de Bruijn sequence is determined. We analyze the feedback function of the composited construction from an approximation point of view for producing strong de Bruijn sequences. The cycle structure of an approximated feedback function and the linear complexity of a sequence produced by an approximated feedback function are determined. A few examples of strong de Bruijn sequences with the implementation issues of the feedback functions of an (n+16)-stage NLFSR are presented. We propose a new lightweight pseudorandom number generator family, named Warbler family based on NLFSRs for smart devices. Warbler family is comprised of a combination of modified de Bruijn blocks (CMDB) and a nonlinear feedback Welch-Gong (WG) generator. We derive the randomness properties such as period and linear complexity of an output sequence produced by the Warbler family. Two instances, Warbler-I and Warbler-II, of the Warbler family are proposed for passive RFID tags. The CMDBs of both Warbler-I and Warbler-II contain span n sequences that are produced by the structured search. We analyze the security properties of Warbler-I and Warbler-II by considering the statistical tests and several cryptanalytic attacks. Hardware implementations of both instances in VHDL show that Warbler-I and Warbler-II require 46 slices and 58 slices, respectively. Warbler-I can be used to generate 16-bit random numbers in the tag identification protocol of the EPC Class 1 Generation 2 standard, and Warbler-II can be employed as a random number generator in the tag identification as well as an authentication protocol for RFID systems. Cryptography Sequence RFID Security De Bruijn Sequence nonlinear feedback shift register Pseudorandom sequence Span n sequence
60	Identification and annotation of full-length genes in Atlantic salmon (Salmo salar) Leong, Jong S. 18 October 2011 (has links) Large-scale expressed sequence tags (ESTs) in Atlantic salmon (Salmo salar) are examined to answer questions regarding salmonid transcriptomes. ESTs represent raw and incomplete gene sequences that need to be read, assembled and analyzed with computer software. The goal of this thesis was to develop an automatically curated and publicly accessible set of annotated full-length genes, representing a near-complete transcript set for Salmo salar. In turn, these genes provide the framework for studies in gene expression, conservation, and molecular evolution. The work presented here also touches on the results of a molecular evolution study, as an example of how full-length gene identification can be used to answer biological questions. Previous to this study, a limited number of Atlantic salmon cDNA libraries and ESTs were known. To further the goal of determining complete gene sequences, highly enriched full-length cDNA libraries and full-length libraries were created and sequenced, resulting in the ability to identify a large number of full-length reference genes. Together, all libraries represent a diverse pool of transcriptome sequences for Salmo salar. The goal of producing an accurate large-scale full-length gene set on a duplicated genome is not trivial. Complete systems for this objective do not readily exist. EST sequencing, EST assembly, and data storage, are just a few of the initial computational issues that are addressed. Once these issues are resolved, the multi-step workflow of full-length gene determination is described. The final challenge involving the development of a concise and universally accessible system for visualization is discussed. The resulting computational framework that has been developed is shown to be able to handle the intricacies and the size of a duplicated salmonid genome. It has been largely accepted that Atlantic salmon have undergone a recent genome duplication. Gene paralogs provide one source of evidence for this event. Analysis of paralogs revealed signatures of asymmetric evolution possibly due to relaxation of selective pressure. This thesis provides a complete Bioinformatics analysis pipeline to analyze and to visualize a set of full-length reference genes for Atlantic salmon. Using full-length genes as a framework, the topic of molecular evolution was addressed to show evidence of asymmetrical evolution among gene duplicates. The full-length reference genes, along with ESTs and all putative transcripts, have been made publicly available. These results serve as a valuable genomic resource for next-generation sequencing and for all other salmonid research endeavours. / Graduate sequence tags gene cDNA

Search results