1 |
Analysis and Abstraction of Parallel Sequence SearchGoddard, Christopher Joseph 03 October 2007 (has links)
The ability to compare two biological sequences is extremely valuable, as matches can suggest evolutionary origins of genes or the purposes of particular amino acids. Results of such comparisons can be used in the creation of drugs, can help combat newly discovered viruses, or can assist in treating diseases.
Unfortunately, the rate of sequence acquisition is outpacing our ability to compute on these data. Further, traditional dynamic programming algorithms are too slow to meet the needs of biologists, who wish to compare millions of sequences daily. While heuristic algorithms improve upon the performance of these dated applications, they still cannot keep up with the steadily expanding search space.
Parallel sequence search implementations were developed to address this issue. By partitioning databases into work units for distributed computation, applications like mpiBLAST are able to achieve super-linear speedup over their sequential counterparts. However, such implementations are limited to clusters and require significant effort to work in a grid environment. Further, their parallelization strategies are typically specific to the target sequence search, so future applications require a reimplementation if they wish to run in parallel.
This thesis analyzes the performance of two versions of mpiBLAST, noting trends as well as differences between them. Results suggest that these embarrassingly parallel applications are dominated by the time required to search vast amounts of data, and not by the communication necessary to support such searches. Consequently, a framework named gridRuby is introduced which alleviates two main issues with current parallel sequence search applications; namely, the requirement of a tightly knit computing environment and the specific, hand-crafted nature of parallelization. Results show that gridRuby can parallelize an application across a set of machines through minimal implementation effort, and can still exhibit super-linear speedup. / Master of Science
|
2 |
Entropy Measurements and Ball Cover Construction for Biological SequencesRobertson, Jeffrey Alan 01 August 2018 (has links)
As improving technology is making it easier to select or engineer DNA sequences that produce dangerous proteins, it is important to be able to predict whether a novel DNA sequence is potentially dangerous by determining its taxonomic identity and functional characteristics. These tasks can be facilitated by the ever increasing amounts of available biological data. Unfortunately, though, these growing databases can be difficult to take full advantage of due to the corresponding increase in computational and storage costs. Entropy scaling algorithms and data structures present an approach that can expedite this type of analysis by scaling with the amount of entropy contained in the database instead of scaling with the size of the database. Because sets of DNA and protein sequences are biologically meaningful instead of being random, they demonstrate some amount of structure instead of being purely random. As biological databases grow, taking advantage of this structure can be extremely beneficial. The entropy scaling sequence similarity search algorithm introduced here demonstrates this by accelerating the biological sequence search tools BLAST and DIAMOND. Tests of the implementation of this algorithm shows that while this approach can lead to improved query times, constructing the required entropy scaling indices is difficult and expensive. To improve performance and remove this bottleneck, I investigate several ideas for accelerating building indices that support entropy scaling searches. The results of these tests identify key tradeoffs and demonstrate that there is potential in using these techniques for sequence similarity searches. / Master of Science / As biological organisms are created and discovered, it is important to compare their genetic information to known organisms in order to detect possible harmful or dangerous properties. However, the collection of published genetic information from known organisms is huge and growing rapidly, making it difficult to search. This thesis shows that it might be possible to use the non-random properties of biological information to increase the speed and efficiency of searches; that is, because genetic sequences are not random but have common structures, the increase of known data does not mean a proportional increase in complexity, known as entropy. Specifically, when comparing a new sequence to a set of previously known sequences, it is important to choose the correct algorithms for comparing the similarity of two sequences, also known as the distance between them. This thesis explores the performance of entropy scaling algorithm compared to several conventional tools.
|
3 |
Heuristic multi-sequence search methodsJochumsson, Thorvaldur January 2001 (has links)
<p>With increasing size of sequence databases heuristic search approaches have become necessary. Hidden Markov models are the best performing search methods known today with respect to discriminative power, but are too time complex to be practical when searching in large sequence databases. In this report, heuristic algorithms that reduce the search space before searching with traditional search algorithms of hidden Markov models are presented and experimentally validated. The results of the validation show that the heuristic search algorithms will speed up the searches without decreasing their discriminative power.</p>
|
4 |
Heuristic multi-sequence search methodsJochumsson, Thorvaldur January 2001 (has links)
With increasing size of sequence databases heuristic search approaches have become necessary. Hidden Markov models are the best performing search methods known today with respect to discriminative power, but are too time complex to be practical when searching in large sequence databases. In this report, heuristic algorithms that reduce the search space before searching with traditional search algorithms of hidden Markov models are presented and experimentally validated. The results of the validation show that the heuristic search algorithms will speed up the searches without decreasing their discriminative power.
|
5 |
IMPROVING REMOTE HOMOLOGY DETECTION USING A SEQUENCE PROPERTY APPROACHCooper, Gina Marie 29 September 2009 (has links)
No description available.
|
6 |
ULTRA-FAST AND MEMORY-EFFICIENT LOOKUPS FOR CLOUD, NETWORKED SYSTEMS, AND MASSIVE DATA MANAGEMENTYu, Ye 01 January 2018 (has links)
Systems that process big data (e.g., high-traffic networks and large-scale storage) prefer data structures and algorithms with small memory and fast processing speed. Efficient and fast algorithms play an essential role in system design, despite the improvement of hardware. This dissertation is organized around a novel algorithm called Othello Hashing. Othello Hashing supports ultra-fast and memory-efficient key-value lookup, and it fits the requirements of the core algorithms of many large-scale systems and big data applications. Using Othello hashing, combined with domain expertise in cloud, computer networks, big data, and bioinformatics, I developed the following applications that resolve several major challenges in the area.
Concise: Forwarding Information Base. A Forwarding Information Base is a data structure used by the data plane of a forwarding device to determine the proper forwarding actions for packets. The polymorphic property of Othello Hashing the separation of its query and control functionalities, which is a perfect match to the programmable networks such as Software Defined Networks. Using Othello Hashing, we built a fast and scalable FIB named \textit{Concise}. Extensive evaluation results on three different platforms show that Concise outperforms other FIB designs.
SDLB: Cloud Load Balancer. In a cloud network, the layer-4 load balancer servers is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. We built a software load balancer with Othello Hashing techniques named SDLB. SDLB is able to accomplish two functionalities of the SDLB using one Othello query: to find the designated server for packets of ongoing sessions and to distribute new or session-free packets.
MetaOthello: Taxonomic Classification of Metagenomic Sequences. Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. We built a system to support efficient classification of taxonomic sequences using its k-mer signatures.
SeqOthello: RNA-seq Sequence Search Engine. Advances in the study of functional genomics produced a vast supply of RNA-seq datasets. However, how to quickly query and extract information from sequencing resources remains a challenging problem and has been the bottleneck for the broader dissemination of sequencing efforts. The challenge resides in both the sheer volume of the data and its nature of unstructured representation. Using the Othello Hashing techniques, we built the SeqOthello sequence search engine. SeqOthello is a reference-free, alignment-free, and parameter-free sequence search system that supports arbitrary sequence query against large collections of RNA-seq experiments, which enables large-scale integrative studies using sequence-level data.
|
Page generated in 0.0709 seconds