11 |
Ambiguous fragment assignment for high-throughput sequencing experimentsRoberts, Adam 28 May 2014 (has links)
<p> As the cost of short-read, high-throughput DNA sequencing continues to fall rapidly, new uses for the technology have been developed aside from its original purpose in determining the genome of various species. Many of these new experiments use the sequencer as a digital counter for measuring biological activities such as gene expression (RNA-Seq) or protein binding (ChIP-Seq). </p><p> A common problem faced in the analysis of these data is that of sequenced fragments that are "ambiguous", meaning they resemble multiple loci in a reference genome or other sequence. In early analyses, such ambiguous fragments were ignored or were assigned to loci using simple heuristics. However, statistical approaches using maximum likelihood estimation have been shown to greatly improve the accuracy of downstream analyses and have become widely adopted Optimization based on the expectation-maximization (EM) algorithm are often employed by these methods to find the optimal sets of alignments, with frequent enhancements to the model. Nevertheless, these improvements increase complexity, which, along with an exponential growth in the size of sequencing datasets, has led to new computational challenges. </p><p> Herein, we present our model for ambiguous fragment assignment for RNA-Seq, which includes the most comprehensive set of parameters of any model introduced to date, as well as various methods we have explored for scaling our optimization procedure. These methods include the use of an online EM algorithm and a distributed EM solution implemented on the Spark cluster computing system. Our advances have resulted in the first efficient solution to the problem of fragment assignment in sequencing.</p><p> Furthermore, we are the first to create a fully generalized model for ambiguous fragment assignment and present details on how our method can provide solutions for additional high-throughput sequencing assays including ChIP-Seq, Allele-Specific Expression (ASE), and the detection of RNA-DNA Differences (RDDs) in RNA-Seq.</p>
|
12 |
Protein structure analysis and prediction utilizing the Fuzzy Greedy K-means Decision Forest model and Hierarchically-Clustered Hidden Markov Models methodHudson, Cody Landon 13 February 2014 (has links)
<p>Structural genomics is a field of study that strives to derive and analyze the structural characteristics of proteins through means of experimentation and prediction using software and other automatic processes. Alongside implications for more effective drug design, the main motivation for structural genomics concerns the elucidation of each protein’s function, given that the structure of a protein almost completely governs its function. Historically, the approach to derive the structure of a protein has been through exceedingly expensive, complex, and time consuming methods such as x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. </p><p> In response to the inadequacies of these methods, three families of approaches developed in a relatively new branch of computer science known as bioinformatics. The aforementioned families include threading, homology-modeling, and the de novo approach. However, even these methods fail either due to impracticalities, the inability to produce novel folds, rampant complexity, inherent limitations, etc. In their stead, this work proposes the Fuzzy Greedy K-means Decision Forest model, which utilizes sequence motifs that transcend protein family boundaries to predict local tertiary structure, such that the method is cheap, effective, and can produce semi-novel folds due to its local (rather than global) prediction mechanism. This work further extends the FGK-DF model with a new algorithm, the Hierarchically Clustered-Hidden Markov Models (HC-HMM) method to extract protein primary sequence motifs in a more accurate and adequate manner than currently exhibited by the FGK-DF model, allowing for more accurate and powerful local tertiary structure predictions. Both algorithms are critically examined, their methodology thoroughly explained and tested against a consistent data set, the results thereof discussed at length. </p>
|
13 |
Analysis and applications of conserved sequence patterns in proteinsIe, Tze Way Eugene. Unknown Date (has links)
Thesis (Ph.D.)--University of California, San Diego, 2007. / (UMI)AAI3264605. Source: Dissertation Abstracts International, Volume: 68-04, Section: B, page: 2446. Adviser: Yoav Freund.
|
14 |
Text mining of point mutation information from biomedical literature.Lee, Lawrence Chet-Lun. January 2008 (has links)
Thesis (Ph.D.)--University of California, San Francisco, 2008. / Source: Dissertation Abstracts International, Volume: 69-12, Section: B, page: 7230. Adviser: Fred E. Cohen.
|
15 |
The design and evaluation of an assistive application for dialysis patientsSiek, Katie A. January 2006 (has links)
Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2006. / "Title from dissertation home page (viewed June 28, 2007)." Source: Dissertation Abstracts International, Volume: 67-06, Section: B, page: 3242. Adviser: Kay H. Connelly.
|
16 |
Computational Pan-Genomics| Algorithms and ApplicationsCleary, Alan Michael 02 June 2018 (has links)
<p> As the cost of sequencing DNA continues to drop, the number of sequenced genomes rapidly grows. In the recent past, the cost dropped so low that it is no longer prohibitively expensive to sequence multiple genomes for the same species. This has led to a shift from the single reference genome per species paradigm to the more comprehensive pan-genomics approach, where populations of genomes from one or more species are analyzed together. </p><p> The total genomic content of a population is vast, requiring algorithms for analysis that are more sophisticated and scalable than existing methods. In this dissertation, we explore new algorithms and their applications to pan-genome analysis, both at the nucleotide and genic resolutions. Specifically, we present the Approximate Frequent Subpaths and Frequented Regions problems as a means of mining syntenic blocks from pan-genomic de Bruijn graphs and provide efficient algorithms for mining these structures. We then explore a variety of analyses that mining synteny blocks from pan-genomic data enables, including meaningful visualization, genome classification, and multidimensional-scaling. We also present a novel interactive data mining tool for pan-genome analysis—the Genome Context Viewer—which allows users to explore pan-genomic data distributed across a heterogeneous set of data providers by using gene family annotations as a unit of search and comparison. Using this approach, the tool is able to perform traditionally cumbersome analyses on-demand in a federated manner.</p><p>
|
17 |
Towards large-scale validation of protein flexibility using rigidity analysisJagodzinski, Filip 01 January 2012 (has links)
Proteins are dynamic molecules involved in virtually every chemical process in our bodies. Understanding how they flex and bend provides fundamental insights to their functions. At the atomic level, protein motion cannot be observed using existing experimental methods. To gain insights into these motions, simulation methods are used. However such simulations are computationally expensive. Rigidity analysis is a fast, alternative graph-based method to molecular simulations, that gives information about the flexibility properties of molecules modeled as mechanical structures. Due to the lack of convenient tools for curating protein data, the usefulness of rigidity analysis has been demonstrated on only a handful of proteins to infer several of their biophysical properties. Previous studies also relied on heuristics to determine which choice of modeling options of important stabilizing interactions allowed for extracting relevant biological observations from rigidity analysis results. Thus there is no agreed-upon choice of modeling of stabilizing interactions that is validated with experimental data. In this thesis we make progress towards large-scale validation of protein flexibility using rigidity analysis. We have developed the KINARI software to test the predictive power of using rigidity analysis to infer biophysical properties of proteins. We develop new tools for curating protein data files and for generating biological functional forms and crystal lattices of molecules. We show that rigidity analysis of these biological assemblies provides structural and functional information that would be missed if only the unprocessed data of protein structures were analyzed. To provide a proof-of-concept that rigidity analysis can be used to perform fast evaluation of in silico mutations that may not be easy to perform in vitro, we have developed KINARI-Mutagen. Finally, we perform a systematic study in which we vary how hydrogen bonds and hydrophobic interactions are modeled when constructing a mechanical framework of a protein. We propose a general method to evaluate how varying the modeling of these important inter-atomic interactions affects the degree to which rigidity parameters correlate with experimental stability data.
|
18 |
Adaptive balancing of exploitation with exploration to improve protein structure predictionBrunette, TJ 01 January 2011 (has links)
The most significant impediment for protein structure prediction is the inadequacy of conformation space search. Conformation space is too large and the energy landscape too rugged for existing search methods to consistently find near-optimal minima. Conformation space search methods thus have to focus exploration on a small fraction of the search space. The ability to choose appropriate regions, i.e. regions that are highly likely to contain the native state, critically impacts the effectiveness of search. To make the choice of where to explore requires information, with higher quality information resulting in better choices. Most current search methods are designed to work in as many domains as possible, which leads to less accurate information because of the need for generality. However, most domains provide unique, and accurate information. To best utilize domain specific information search needs to be customized for each domain. The first contribution of this thesis customizes search for protein structure prediction, resulting in significantly more accurate protein structure predictions. Unless information is perfect, mistakes will be made, and search will focus on regions that do not contain the native state. How search recovers from mistakes is critical to its effectiveness. To recover from mistakes, this thesis introduces the concept of adaptive balancing of exploitation with exploration. Adaptive balancing of exploitation with exploration allows search to use information only to the extent to which it guides exploration toward the native state. Existing methods of protein structure prediction rely on information from known proteins. Currently, this information is from either full-length proteins that share similar sequences, and hence have similar structures (homologs), or from short protein fragments. Homologs and fragments represent two extremes on the spectrum of information from known proteins. Significant additional information can be found between these extremes. However, current protein structure prediction methods are unable to use information between fragments and homologs because it is difficult to identify the correct information from the enormous amount of incorrect information. This thesis makes it possible to use information between homologs and fragments by adaptively balancing exploitation with exploration in response to an estimate of template protein quality. My results indicate that integrating the information between homologs and fragments significantly improves protein structure prediction accuracy, resulting in several proteins predicted with <1 angstrom RMSD resolution.
|
19 |
An Ensemble Prognostic Model for Metastatic, Castrate-Resistant Prostate CancerVang, Yeeleng Scott 20 October 2016 (has links)
<p> Metastatic, castrate-resistant prostate cancer (mCRPC) is one of the most prevalent cancers and is the third leading cause of cancer death among men. Several treatment options have been developed to combat mCRPC, however none have produced any tangible benefits to patients' overall survivability. As part of a crowd-sourced algorithm development competition, participants were asked to develop new prognostic models for mCRPC patients treated with docetaxel. Such results could potentially assist in clinical decision making for future mCRPC patients. </p><p> In this thesis, we present a new ensemble prognostic model to perform risk prediction for mCRPC patients treated with docetaxel. We rely on traditional survival analysis model like the Cox Proportional Hazard model, as well as more recently developed boosting model that incorporates smooth approximation of the concordance index for direct optimization. Our model performs better than the the current state-of-the-art mCRPC prognostic models for the concordance index performance measure and is competitive with these models on the integrated time-dependent area under the receiver operating characteristic curve.</p>
|
20 |
Algorithmic Enhancements to Data Colocation Grid Frameworks for Big Data Medical Image ProcessingBao, Shunxing 19 April 2019 (has links)
<p> Large-scale medical imaging studies to date have predominantly leveraged in-house, laboratory-based or traditional grid computing resources for their computing needs, where the applications often use hierarchical data structures (e.g., Network file system file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance for laboratory-based approaches reveal that performance is impeded by standard network switches since typical processing can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. On the other hand, the grid may be costly to use due to the dedicated resources used to execute the tasks and lack of elasticity. With increasing availability of cloud-based big data frameworks, such as Apache Hadoop, cloud-based services for executing medical imaging studies have shown promise.</p><p> Despite this promise, our studies have revealed that existing big data frameworks illustrate different performance limitations for medical imaging applications, which calls for new algorithms that optimize their performance and suitability for medical imaging. For instance, Apache HBases data distribution strategy of region split and merge is detrimental to the hierarchical organization of imaging data (e.g., project, subject, session, scan, slice). Big data medical image processing applications involving multi-stage analysis often exhibit significant variability in processing times ranging from a few seconds to several days. Due to the sequential nature of executing the analysis stages by traditional software technologies and platforms, any errors in the pipeline are only detected at the later stages despite the sources of errors predominantly being the highly compute-intensive first stage. This wastes precious computing resources and incurs prohibitively higher costs for re-executing the application. To address these challenges, this research propose a framework - Hadoop & HBase for Medical Image Processing (HadoopBase-MIP) - which develops a range of performance optimization algorithms and employs a number of system behaviors modeling for data storage, data access and data processing. We also introduce how to build up prototypes to help empirical system behaviors verification. Furthermore, we introduce a discovery with the development of HadoopBase-MIP about a new type of contrast for medical imaging deep brain structure enhancement. And finally we show how to move forward the Hadoop based framework design into a commercialized big data / High performance computing cluster with cheap, scalable and geographically distributed file system.</p><p>
|
Page generated in 0.1132 seconds