711 |
Efficient methods for improving the sensitivity and accuracy of RNA alignments and structure predictionLi, Yaoman, 李耀满 January 2013 (has links)
RNA plays an important role in molecular biology. RNA sequence comparison is an important method to analysis the gene expression. Since aligning RNA reads needs to handle gaps, mutations, poly-A tails, etc. It is much more difficult than aligning other sequences. In this thesis, we study the RNA-Seq align tools, the existing gene information database and how to improve the accuracy of alignment and predict RNA secondary structure.
The known gene information database contains a lot of reliable gene information that has been discovered. And we note most DNA align tools are well developed. They can run much faster than existing RNA-Seq align tools and have higher sensitivity and accuracy. Combining with the known gene information database, we present a method to align RNA-Seq data by using DNA align tools. I.e. we use the DNA align tools to do alignment and use the gene information to convert the alignment to genome based.
The gene information database, though updated daily, there are still a lot of genes and alternative splicings that hadn't been discovered. If our RNA align tool only relies on the known gene database, then there may be a lot reads that come from unknown gene or alternative splicing cannot be aligned. Thus, we show a combinational method that can cover potential alternative splicing junction sites. Combining with the original gene database, the new align tools can cover most alignments which are reported by other RNA-Seq align tools.
Recently a lot of RNA-Seq align tools have been developed. They are more powerful and faster than the old generation tools. However, the RNA read alignment is much more complicated than other sequence alignment. The alignments reported by some RNA-Seq align tools have low accuracy. We present a simple and efficient filter method based on the quality score of the reads. It can filter most low accuracy alignments.
At last, we present a RNA secondary prediction method that can predict pseudoknot(a type of RNA secondary structure) with high sensitivity and specificity. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
712 |
Algorithms for evolving graph analysisRen, Chenghui, 任成會 January 2014 (has links)
In many applications, entities and their relationships are represented by graphs. Examples include social networks (users and friendship), the WWW (web pages and hyperlinks) and bibliographic networks (authors and co-authorship). In a dynamic world, information changes and so the graphs representing the information evolve with time. For example, a Facebook link between two friends is established, or a hyperlink is added to a web page. We propose that historical graph-structured data be archived for analytical processing. We call a historical
evolving graph sequence an EGS.
We study the problem of efficient query processing on an EGS, which finds many applications that lead to interesting evolving graph analysis. To solve the problem, we propose a solution framework called FVF and a cluster-based LU decomposition algorithm called CLUDE, which can evaluate queries efficiently to support EGS analysis.
The Find-Verify-and-Fix (FVF) framework applies to a wide range of queries. We demonstrate how some important graph measures, including shortest-path distance, closeness centrality and graph centrality, can be efficiently computed from EGSs using FVF. Since an EGS generally contains numerous large graphs, we also discuss several compact storage models that support our FVF framework. Through extensive experiments on both real and synthetic datasets, we show that our FVF framework is highly efficient in EGS query processing.
A graph can be conveniently modeled by a matrix from which various quantitative measures are derived like PageRank and SALSA and Personalized PageRank and Random Walk with Restart. To compute these measures, linear systems of the form Ax = b, where A is a matrix that captures a graph's structure, need to be solved. To facilitate solving the linear system, the matrix A is often decomposed into two triangular matrices (L and U). In a dynamic world, the graph that models it changes with time and thus is the matrix A that represents the graph. We consider a sequence of evolving graphs and its associated sequence of evolving matrices. We study how LU-decomposition should be done over the sequence so that (1) the decomposition is efficient and (2) the resulting LU matrices best preserve the sparsity of the matrices A's (i.e., the number of extra non-zero entries introduced in L and U are minimized). We propose a cluster-based algorithm CLUDE for solving the problem. Through an experimental study, we show that CLUDE is about an order of magnitude faster than the traditional incremental update algorithm. The number of extra non-zero entries introduced by CLUDE is also about an order of magnitude fewer than that of the traditional algorithm. CLUDE is thus an efficient algorithm for LU decomposition that produces high-quality LU matrices over an evolving matrix sequence. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
713 |
Using semantic sub-scenes to facilitate scene categorization and understandingZhu, Shanshan, 朱珊珊 January 2014 (has links)
This thesis proposes to learn the absent cognitive element in conventional scene categorization methods: sub-scenes, and use them to better categorize and understand scenes. In scene categorization, it has been observed that the problem of ambiguity occurs when treating the scene as a whole. Scene ambiguity arises from when a similar set of sub-scenes are arranged differently to compose different scenes, or when a scene literally contains several categories. However, these ambiguities can be discerned by the knowledge of sub-scenes. Thus, it is worthy to study sub-scenes and use them to better understand a scene.
The proposed research firstly considers an unsupervised method to segment sub-scenes. It emphasizes on generating more integral regions instead of over-segmented regions usually produced by conventional segmentation methods. Several properties of sub-scenes are explored such as proximity grouping, area of influence, similarity and harmony based on psychological principles. These properties are formulated into constraints that are used directly in the proposed framework. A self-determined approach is employed to produce a final segmentation result based on the characteristics of each image in an unsupervised manner. The proposed method performs competitively against other state-of-the-art unsupervised segmentation methods with F-measure of 0.55, Covering of 0.51 and VoI of 1.93 in the Berkeley segmentation dataset. In the Stanford background dataset, it achieves the overlapping score of 0.566 which is higher than the score of 0.499 of the comparison method.
To segment and label sub-scenes simultaneously, a supervised approach of semantic segmentation is proposed. It is developed based on a Hierarchical Conditional Random Field classification framework. The proposed method integrates contextual information into the model to improve classification performance. Contextual information including global consistency and spatial context are considered in the proposed method. Global consistency is developed based on generalizing the scene by scene types and spatial context takes the spatial relationship into account. The proposed method improves semantic segmentation by boosting more logical class combinations. It achieves the best score in the MSRC-21 dataset with global accuracy at 87% and the average accuracy at 81%, which out-performs all other state-of-the-art methods by 4% individually. In the Stanford background dataset, it achieves global accuracy at 80.5% and average accuracy at 71.8%, also out-performs other methods by 2%.
Finally, the proposed research incorporates sub-scenes into the scene categorization framework to improve categorization performance, especially in ambiguity cases. The proposed method encodes the sub-scene in the way that their spatial information is also considered. Sub-scene descriptor compensates the global descriptor of a scene by evaluating local features with specific geometric attributes. The proposed method obtains an average categorization accuracy of 92.26% in the 8 Scene Category dataset, which outperforms all other published methods by over 2% of improvement. It evaluates ambiguity cases more accurately by discerning which part exemplifies a scene category and how those categories are organized. / published_or_final_version / Electrical and Electronic Engineering / Doctoral / Doctor of Philosophy
|
714 |
Competitive online job scheduling algorithms under different energy management modelsChan, Sze-hang, 陳思行 January 2013 (has links)
Online flow-time scheduling is a fundamental problem in computer science and has been extensively studied for years. It is about how to design a scheduler to serve computer jobs with unpredictable arrival times and varying sizes and priorities so as to minimize the total flow time (better understood as response time) of jobs. It has many applications, most notable in the operating of server farms. As energy has become an important issue, the design of scheduler also has to take power management into consideration, for example, how to scale the speed of the processors dynamically. The objectives are orthogonal as one would prefer lower processor speed to save energy, yet a good quality of service must be retained. In this thesis, I study a few scheduling problems for energy and flow time in depth and give new algorithms to tackle them. The competitiveness of our algorithms is guaranteed with worst-case mathematical analysis against the best possible or hypothetical solutions.
In the speed scaling model, the power of a processor increases with its speed according to a certain function (e.g., a cubic function of speed). Among all online scheduling problems with speed scaling, the nonclairvoyant setting (in which the size of a job is not known during its execution) with arbitrary priorities is perhaps the most challenging. This thesis gives the first competitive algorithm called WLAPS for this setting.
In reality, it is not uncommon that during the peak-load period, some (low-priority) users have their jobs rejected by the servers. This triggers me to study more complicated scheduling algorithms that can strike a good balance among speed scaling, flow time and rejection penalty. Two new algorithms UPUW and HDFAC for different models of rejection penalty have been proposed and analyzed.
Last, but perhaps the most interesting, we study power management in large server farm environment in which the primary energy saving mechanism is to put some processors to sleep. Two new algorithms POOL and SATA have been designed to tackle jobs that cannot and can migrate among the processors, respectively. They are integrated algorithms that can consider speed scaling, job scheduling and processor sleep management together to optimize the energy usage and ow time simultaneously. These algorithms are again proven mathematically to be competitive even in the worst case. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
715 |
Workflows for identifying differentially expressed small RNAs and detection of low copy repeats in humanLiu, Xuan, 刘璇 January 2014 (has links)
With the rapid development of next-generation sequencing NGS technology, we are able to investigate various aspects biological problems, including genome and transcriptome sequencing, genomic structural variation and the mechanism of regulatory small RNAs, etc. An enormous number of associated computational methods have been proposed to study the biological problems using NGS reads, at a low cost of expense and time. Regulatory small RNAs and genomic structure variations are two main problems that we have studied.
In the area of regulatory small RNA, various computational tools have been designed from the prediction of small RNA to target prediction. Regulatory small RNAs play essential roles in plants and bacteria such as in responses to environmental stresses. We focused on sRNAs that in act by base pairing with target mRNA in complementarity. A comprehensive analysis workflow that is able to integrate sRNA-Seq and RNA-Seq analysis and generate regulatory network haven't been designed yet. Thus, we proposed and implemented two small RNA analysis workflow for plants and bacteria respectively.
In the area of genomic structural variations (SV), two types of disease-related SVs have been investigated, including complex low copy repeats (LCRs, also termed as segmental duplications) and tandem duplication (TD). LCRs provide structural basis to form a combination of other SVs which may in turn lead to some serious genetic diseases and TDs of specific areas have been reported for patients. Locating LCRs and TDs in human genome can help researchers to further interrogate the mechanism of related diseases. Therefore, we proposed two computational methods to predict novel LCRs and TDs in human genome. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
716 |
Binning and annotation for metagenomic next-generation sequencing readsWang, Yi, 王毅 January 2014 (has links)
The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions.
There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance.
The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient.
Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
717 |
Practical Delaunay triangulation algorithms for surface reconstruction and related problemsChoi, Sunghee 28 August 2008 (has links)
Not available / text
|
718 |
Maximum likelihood techniques for joint segmentation-classification of multi-spectral chromosome imagesSchwartzkopf, Wade Carl 28 August 2008 (has links)
Not available / text
|
719 |
Integration of hard real-time schedulersWang, Weirong 28 August 2008 (has links)
Not available / text
|
720 |
Accounting for uncertainty, robustness and online information in transportation networksUkkusuri, Satish V. S. K. 28 August 2008 (has links)
Not available / text
|
Page generated in 0.1417 seconds