Spelling suggestions: "subject:"[een] BIOINFORMATICS"" "subject:"[enn] BIOINFORMATICS""
721 |
Structure-Based Virtual Screening in SparkCapuccini, Marco January 2015 (has links)
No description available.
|
722 |
Kubernetes as an approach for solving bioinformatic problems.Markstedt, Olof January 2017 (has links)
The cluster orchestration tool Kubernetes enables easy deployment and reproducibility of life science research by utilizing the advantages of the container technology. The container technology allows for easy tool creation, sharing and runs on any Linux system once it has been built. The applicability of Kubernetes as an approach to run bioinformatic workflows was evaluated and resulted in some examples of how Kubernetes and containers could be used within the field of life science and how they should not be used. The resulting examples serves as proof of concepts and the general idea of how implementation is done. Kubernetes allows for easy resource management and includes automatic scheduling of workloads. It scales rapidly and has some interesting components that are beneficial when conducting life science research.
|
723 |
A study of analogies between processes in technical and biological systemsShaw, Ian Stephan 27 May 2013 (has links)
D.Phil. (Electrical and Electronic Engineering) / The knowledge and understanding that a scientist has about the world is often embodied in the form of a model, which is a representation containing the essential structure of some object or event. The goal of the scientific method is to reduce the complexity of our observations on our surroundings (and ourselves) by creating, verifying and modifying simplified models. In turn, a technical scientist (commonly referred to as an “engineer”) uses appropriately simplified mathematical models to predict and control various processes. Yet the central question is as to how close such models are to reality in spite of considerable simplifying assumptions, and whether or not they are reliable and credible enough to be accepted as being valid. In the following, models applied in technical science (commonly referred to as “engineering”) are examined to find out whether or not such mathematical models are valid in biology as well. In fact, it is shown, that such models do fall short of a valid representation of biological phenomena. In turn, the concept of analogy, a method borrowed from cognitive science, is introduced as another way of knowledge representation and model construction.
|
724 |
Cell States and Cell Fate: Statistical and Computational Models in (Epi)GenomicsFernandez, Daniel 18 March 2015 (has links)
This dissertation develops and applies several statistical and computational methods to the analysis of Next Generation Sequencing (NGS) data in order to gain a better understanding of our biology. In the rest of the chapter we introduce key concepts in molecular biology, and recent technological developments that help us better understand this complex science, which, in turn, provide the foundation and motivation for the subsequent chapters.
In the second chapter we present the problem of estimating gene/isoform expression at the allelic level, and different models to solve this problem. First, we describe the observed data and the computational workflow to process the data. Next, we propose frequentist and bayesian models motivated by the central dogma of molecular biology and the data generating process (DGP) for RNA-Seq. We develop EM and Gibbs sampling approaches to estimate gene and transcript-specic expression from our proposed models. Finally, we present the performance of our models in simulations and we end with the analysis of experimental RNA-Seq data at the allelic level.
In the third chapter we present our paired factorial experimental design to study parentally biased gene/isoform expression in the mouse cerebellum, and dynamic changes of this pattern between young and adult stages of cerebellar development. We present a bayesian variable selection model to estimate the difference in expression between the paternal and maternal genes, while incorporating relevant factors and its interactions into the model. Next, we apply our model to our experimental data, and further on we validate our predictions using pyrosequencing follow-up experiments. We subsequently applied our model to the pyrosequencing data across multiple brain regions. Our method, combined with the validation experiments, allowed us to find novel imprinted genes, and investigate, for the first time, imprinting dynamics across brain regions and across development.
In the fourth chapter we move from the controlled-experiments in mouse isogenic lines to the highly variant world of human genetics in observational studies. In this chapter we introduce a Bayesian Regression Allelic Imbalance Model, BRAIM, that estimates the imbalance coming from two major sources: cis-regulation and imprinting. We model the cis-effect as an additive effect for the heterozygous group and we model the parent-of-origin detect with a latent variable that indicates to which parent a given allele belongs. Next, we show the performance of the model under simulation scenarios, and finally we apply the model to several experiments across multiple tissues and multiple individuals.
In the fifth chapter we characterize the transcriptional regulation and gene expression of in-vitro Embryonic Stem Cells (ESCs), and two-related in-vivo cells; the Inner Cell Mass (ICM) tissue, and the embryonic tissue at day 6.5. Our objective is two fold. First we would like to understand the differences in gene expression between the ESCs and their in-vivo counterpart from where these cells were derived (ICM). Second, we want to characterize the active transcriptional regulatory regions using several histone modifications and to connect such regulatory activity with gene expression. In this chapter we used several statistical and computational methods to analyze and visualize the data, and it provides a good showcase of how combining several methods of analysis we can delve into interesting developmental biology.
|
725 |
Quantitative Methods for Analyzing Structure in Genomes, Self-Assembly, and Random MatricesHuntley, Miriam 25 July 2017 (has links)
This dissertation presents my graduate work analyzing biological structure. My research spans three different areas, which I discuss in turn. First I present my work studying how the genome folds. The three-dimensional structure of the genome inside of the nucleus is a matter of great biological importance, yet there are many questions about just how the genetic material is folded up. To probe this, we performed Hi-C experiments to create the highest resolution dataset (to date) of genome-wide contacts in the nucleus. Analysis of this data uncovered an array of fundamental structures in the folded genome. We discovered approximately 10,000 loops in the human genome, which each bring a pair of loci far apart along the DNA strand (up to millions of basepairs away) into close proximity. We found that contiguous stretches of DNA are segregated into self-associating contact domains. These domains are associated with distinct patterns of histone marks and segregate into six nuclear subcompartments. We found that these spatial structures are deeply connected to the regulation of the genome and cell function, suggesting that understanding and characterizing the 3D structure of the genome is crucial for a complete description of biology. Second, I present my work on self-assembly. Many biological structures are formed via `bottom-up' assembly, wherein a collection of subunits assemble into a complex arrangement. In this work we developed a theory which predicts the fundamental complexity limits for these types of systems. Using an information theory framework, we calculated the capacity, the maximum amount of information that can be encoded and decoded in systems of specific interactions, giving possible future directions for improvements in experimental realizations of self-assembly. Lastly, I present work examining the statistical structure of noisy data. Experimental datasets are a combination of signal and randomness, and data analysis algorithms, such as Principal Component Analysis (PCA), all seek to extract the signal. We used random matrix theory to demonstrate that even in situations where the dataset contains too much noise for PCA to be successful, the signal can be still be recovered with the use of prior information. / Engineering and Applied Sciences - Applied Math
|
726 |
Complexity Reduction for Near Real-Time High Dimensional Filtering and Estimation Applied to Biological SignalsGupta, Manish 25 July 2017 (has links)
Real-time processing of physiological signals collected from wearable sensors that can be done with low computational power is a requirement for continuous health monitoring. Such processing involves identifying underlying physiological state x from a measured biomedical signal y, that are related stochastically: y = f(x; e) (here e is a random variable). Often the state space of x is large, and the dimensionality of y is low: if y has dimension N and S is the state space of x then |S| >> N, since the purpose is to infer a complex physiological state from minimal measurements. This makes real-time inference a challenging task. We present algorithms that address this problem by using lower dimensional approximations of the state. Our algorithms are based on two techniques often used for state dimensionality reduction: (a) decomposition where variables can be grouped into smaller sets, and (b) factorization where variables can be factored into smaller sets. The algorithms are computationally inexpensive, and permit online application. We demonstrate their use in dimensionality reduction by successfully solving two real complex problems in medicine and public safety.
Motivated originally by the problem of predicting cognitive fatigue state from EEG (Chapter 1), we developed the Correlated Sparse Signal Recovery (CSSR) algorithm and successfully applied it to the problem of elimination of blink artifacts in EEG from awake subjects (Chapter 2). Finding the decomposition x = x1+ x2 into a low dimensional representation of the artifact signal x1 is a non-trivial problem and currently there are no online real-time methods accurately solve the problem for small N (dimensionality of y). By using a skew-Gaussian dictionary and a novel method to represent group statistical structure, CSSR is able to identify and remove blink artifacts even from few (e.g. 4-6) channels of EEG recordings in near real-time. The method uses a Bayesian framework. It results in more effective decomposition, as measured by spectral and entropy properties of the decomposed signals, compared to some state-of-the-art artifact subtraction and structured sparse recovery methods. CSSR is novel in structured sparsity: unlike existing group sparse methods (such as block sparse recovery) it does not rely on the assumption of a common sparsity profile. It is also a novel EEG denoising method: unlike state-of-the art artifact removal technique such as independent components analysis, it does not require manual intervention, long recordings or high density (e.g. 32 or more channels) recordings. Potentially this method of denoising is of tremendous utility to the medical community since EEG artifact removal is usually done manually, which is a lengthy tedious process requiring trained technicians and often making entire epochs of data unuseable. Identification of the artifact in itself can be used to determine some physiological state relevant from the artifact properties (for example, blink duration and frequency can be used as a marker of fatigue). A potential application of CSSR is to determine if structurally decomposed cortical EEG (i.e. non-spectral ) representation can instead be used for fatigue prediction.
A new E-M based active learning algorithm for ensemble classification is presented in Chapter 3 and applied to the problem of detection of artifactual epochs based upon several criteria including the sparse features obtained from CSSR. The algorithm offers higher accuracy than existing ensemble methods for unsupervised learning such as similarity- and graph-based ensemble clustering, as well as higher accuracy and lower computational complexity than several active learning methods such as Query-by-Committee and Importance-Weighted Active Learning when tested on data comprising of noisy Gaussian mixtures. In one case we were to successfully identify artifacts with approximately 98% accuracy based upon 31-dimensional data from 700,000 epochs in a matter of seconds on a personal laptop using less than 10% active labels. This is to be compared to a maximum of 94% from other methods. As far as we know, the area of active learning for ensemble-based classification has not been previously applied to biomedical signal classification including artifact detection; it can also be applied to other medical areas, including classification of polysomnographic signals into sleep stages.
Algorithms based upon state-space factorization in the case where there is unidirectional dependence amongst the dynamics groups of variables ( the "Cascade Markov Model") are presented in Chapters 4. An algorithm for estimation of factored state where dynamics follow a Markov model from observations is developed using E-M (i.e. a version of Baum-Welch algorithm on factored state spaces) and applied to real-time human gait and fall detection. The application of factored HMMs to gait and fall detection is novel; falls in the elderly are a major safety issue. Results from the algorithm show higher fall detection accuracy (95%) than that achieved with PCA based estimation (70%). In this chapter, a new algorithm for optimal control on factored Markov decision processes is derived. The algorithm, in the form of decoupled matrix differential equations, both is (i) computationally efficient requiring solution of a one-point instead of two-point boundary value problem and (ii) obviates the "curse of dimensionality" inherent in HJB equations thereby facilitating real-time solution. The algorithm may have application to medicine, such as finding optimal schedules of light exposure for correction of circadian misalignment and optimal schedules for drug intervention in patients.
The thesis demonstrates development of new methods for complexity reduction in high dimensional systems and that their application solves some problems in medicine and public safety more efficiently than state-of-the-art methods. / Engineering and Applied Sciences - Applied Math
|
727 |
Coding to cure : NMR and thermodynamic software applied to congenital heart disease researchNiklasson, Markus January 2017 (has links)
Regardless of scientific field computers have become pivotal tools for data analysis and the field of structural biology is not an exception. Here, computers are the main tools used for tasks including structural calculations of proteins, spectral analysis of nuclear magnetic resonance (NMR) spectroscopy data and fitting mathematical models to data. As results reported in papers heavily rely on software and scripts it is of key importance that the employed computational methods are robust and yield reliable results. However, as many scientific fields are niched and possess a small potential user base the task to develop necessary software often falls on researchers themselves. This can cause divergence when comparing data analyzed by different measures or by using subpar methods. Therein lies the importance of development of accurate computational methods that can be employed by the scientific community. The main theme of this thesis is software development applied to structural biology, with the purpose to aid research in this scientific field by speeding up the process of data analysis as well as to ensure that acquired data is properly analyzed. Among the original results of this thesis are three user-friendly software: COMPASS - a resonance assignment software for NMR spectroscopy data capable of analyzing chemical shifts and providing the user with suggestions to potential resonance assignments, based on a meticulous database comparison. CDpal - a curve fitting software used to fit thermal and chemical denaturation data of proteins acquired by circular dichroism (CD) spectroscopy or fluorescence spectroscopy. PINT - a line shape fitting and downstream analysis software forNMRspectroscopy data, designed with the important purpose to easily and accurately fit peaks in NMR spectra and extract parameters such as relaxation rates, intensities and volumes of peaks. This thesis also describes a study performed on variants of the life essential regulatory protein calmodulin that have been associated with the congenital life threatening heart disease long QT syndrome (LQTS). The study provided novel insights revealing that all variants are distinct from the wild type in regards to structure and dynamics on a detailed level; the presented results are useful for the interpretation of results from protein interaction studies. The underlying research of this paper makes use of all three developed software, which validates that all developed methods fulfil a scientific purpose and are capable of producing solid results.
|
728 |
On the Generation of a Classification Algorithm from DNA Based Microarray StudiesDavies, Robert William January 2010 (has links)
The purpose of this thesis is to build a classification algorithm using a Genome Wide Association (GWA) study. Briefly, a GWA is a case-control study using genotypes derived from DNA microarrays for thousands of people. These microarrays are able to acquire the genotypes of hundreds of thousands of Single Nucleotide Polymorphisms (SNPs) for a person at a time. In this thesis, we first describe the processes necessary to prepare the data for analysis. Next, we introduce the Naive Bayes classification algorithm and a modification so that effects of a SNP on the disease of interest are weighted by a Bayesian posterior probability of association. This thesis then uses the data from three coronary artery disease GWAs, one as a training set and two as test sets, to build and test the classifier. Finally, this thesis discusses the relevance of the results and the generalizability of this method to future studies.
|
729 |
Predicting drug target proteins and their propertiesBull, Simon January 2015 (has links)
The discovery of drug targets is a vital component in the development of therapeutic treatments, as it is only through the modulation of a target’s activity that a drug can alleviate symptoms or cure. Accurate identification of drug targets is therefore an important part of any development program, and has an outsized impact on the program’s success due to its position as the first step in the pipeline. This makes the stringent selection of potential targets all the more vital when attempting to control the increasing cost and time needed to successfully complete a development program, and in order to increase the throughput of the entire drug discovery pipeline. In this work, a computational approach was taken to the investigation of protein drug targets. First, a new heuristic, Leaf, for the approximation of a maximum independent set was developed, and evaluated in terms of its ability to remove redundancy from protein datasets, the goal being to generate the largest possible non-redundant dataset. The ability of Leaf to remove redundancy was compared to that of pre-existing heuristics and an optimal algorithm, Cliquer. Not only did Leaf find unbiased non-redundant sets that were around 10% larger than the commonly used PISCES algorithm, it found ones that were no more than one protein smaller than the maximum possible found by Cliquer. Following this, the human proteome was mined to discover properties of proteins that may be important in determining their suitability for pharmaceutical modulation. Data was gathered concerning each protein’s sequence, post-translational modifications, secondary structure, germline variants, expression profile and target status. The data was then analysed to determine features for which the target and non-target proteins had significantly different values. This analysis was repeated for subsets of the proteome consisting of all GPCRs, ion channels, kinases and proteases, as well as for a subset consisting of all proteins that are implicated in cancer. Next, machine learning was used to quantify the proteins in each dataset in terms of their potential to serve as a drug target. For each dataset, this was accomplished by first inducing a random forest that could distinguish between its targets and non-targets, and then using the random forest to quantify the drug target likeness of the non-targets. The properties that can best differentiate targets from non-targets were primarily found to be those that are directly related to a protein’s sequence (e.g. secondary structure). Germline variants, expression levels and interactions between proteins had minimal discriminative power. Overall, the best indicators of drug target likeness were found to be the proteins’ hydrophobicities, in vivo half-lives, propensity for being membrane bound and the fraction of non-polar amino acids in their sequences. In terms of predicting potential targets, datasets of proteases, ion channels and cancer proteins were able to induce random forests that were highly capable of distinguishing between targets and non-targets. The non-target proteins predicted to be targets by these random forests comprise the set of the most suitable potential future drug targets, and are therefore likely to produce the best results if used as the basis for building a drug development programme.
|
730 |
Data analysis of salmonid environmental DNA measurements obtained via controlled experiments and from several Pacific streamsSneiderman, Robert 13 January 2021 (has links)
Standard sampling and monitoring of fish populations are invasive and time- consuming techniques. The ongoing development of statistical techniques to analyze environmental DNA (eDNA) introduces a possible solution to these challenges. We analyzed and created statistical models for qPCR data obtained from two controlled experiments that were conducted on samples of Coho salmon at the Goldstream Hatchery.
The first experiment analyzed was a density experiment whereby varying num- bers of Coho (1, 2, 4, 8, 16, 32 and 65 fish) were placed in separate tanks and eDNA measurements were taken. The second experiment dealt with dilution, whereby three Coho were placed into tanks, removed and eDNA was then sampled at dilution vol- umes of 20kL, 40kL, 80kL, 160kL and 1000kL.
Finally, we analyzed a set of field data from several streams in the Pacific North West for the presence of Coho salmon. In the field models, we considered the impact of environmental covariates as well as eDNA concentrations.
Our analysis suggests that eDNA concentration can be used as a reliable proxy to estimate Coho biomass. / Graduate / 2021-11-20
|
Page generated in 0.0867 seconds