Global ETD Search

221	A systems approach to computational protein identification Ramakrishnan, Smriti Rajan 21 October 2010 (has links) Proteomics is the science of understanding the dynamic protein content of an organism's cells (its proteome), which is one of the largest current challenges in biology. Computational proteomics is an active research area that involves in-silico methods for the analysis of high-throughput protein identification data. Current methods are based on a technology called tandem mass spectrometry (MS/MS) and suffer from low coverage and accuracy, reliably identifying only 20-40% of the proteome. This dissertation addresses recall, precision, speed and scalability of computational proteomics experiments. This research goes beyond the traditional paradigm of analyzing MS/MS experiments in isolation, instead learning priors of protein presence from the joint analysis of various systems biology data sources. This integrative `systems' approach to protein identification is very effective, as demonstrated by two new methods. The first, MSNet, introduces a social model for protein identification and leverages functional dependencies from genome-scale, probabilistic, gene functional networks. The second, MSPresso, learns a gene expression prior from a joint analysis of mRNA and proteomics experiments on similar samples. These two sources of prior information result in more accurate estimates of protein presence, and increase protein recall by as much as 30% in complex samples, while also increasing precision. A comprehensive suite of benchmarking datasets is introduced for evaluation in yeast. Methods to assess statistical significance in the absence of ground truth are also introduced and employed whenever applicable. This dissertation also describes a database indexing solution to improve speed and scalability of protein identification experiments. The method, MSFound, customizes a metric-space database index and its associated approximate k-nearest-neighbor search algorithm with a semi-metric distance designed to match noisy spectra. MSFound achieves an order of magnitude speedup over traditional spectra database searches while maintaining scalability. / text Computational biology Bioinformatics Integrative statistical data analysis Computational proteomics Systems biology Database indexing
222	Lagrangian Relaxation - Solving NP-hard Problems in Computational Biology via Combinatorial Optimization Canzar, Stefan 14 November 2008 (has links) (PDF) This thesis is devoted to two $\mathcal{NP}$-complete combinatorial optimization problems arising in computational biology, the well-studied \emph{multiple sequence alignment} problem and the new formulated \emph{interval constraint coloring} problem. It shows that advanced mathematical programming techniques are capable of solving large scale real-world instances from biology to optimality. Furthermore, it reveals alternative methods that provide approximate solutions. In the first part of the thesis, we present a \emph{Lagrangian relaxation} approach for the multiple sequence alignment (MSA) problem. The multiple alignment is one common mathematical abstraction of the comparison of multiple biological sequences, like DNA, RNA, or protein sequences. If the weight of a multiple alignment is measured by the sum of the projected pairwise weights of all pairs of sequences in the alignment, then finding a multiple alignment of maximum weight is $\mathcal{NP}$-complete if the number of sequences is not fixed. The majority of the available tools for aligning multiple sequences implement heuristic algorithms; no current exact method is able to solve moderately large instances or instances involving sequences exhibiting a lower degree of similarity. We present a branch-and-bound (B\&B) algorithm for the MSA problem.\ignore{the multiple sequence alignment problem.} We approximate the optimal integer solution in the nodes of the B\&B tree by a Lagrangian relaxation of an ILP formulation for MSA relative to an exponential large class of inequalities, that ensure that all pairwise alignments can be incorporated to a multiple alignment. By lifting these constraints prior to dualization the Lagrangian subproblem becomes an \emph{extended pairwise alignment} (EPA) problem: Compute the longest path in an acyclic graph, that is penalized a charge for entering ``obstacles''. We describe an efficient algorithm that solves the EPA problem repetitively to determine near-optimal \emph{Lagrangian multipliers} via subgradient optimization. The reformulation of the dualized constraints with respect to additionally introduced variables improves the convergence rate dramatically. We account for the exponential number of dualized constraints by starting with an empty \emph{constraint pool} in the first iteration to which we add cuts in each iteration, that are most violated by the convex combination of a small number of preceding Lagrangian solutions (including the current solution). In this \emph{relax-and-cut} scheme, only inequalities from the constraint pool are dualized. The interval constraint coloring problem appears in the interpretation of experimental data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spectroscopy is a method used to obtain information about protein tertiary structure. The output of these experiments provides aggregate data about the exchange rate of residues in overlapping fragments of the protein backbone. These fragments must be re-assembled in order to obtain a global picture of the protein structure. The interval constraint coloring problem is the mathematical abstraction of this re-assembly process. The objective of the interval constraint coloring problem is to assign a color (exchange rate) to a set of integers (protein residues) such that a set of constraints is satisfied. Each constraint is made up of a closed interval (protein fragment) and requirements on the number of elements in the interval that belong to each color class (exchange rates observed in the experiments). We introduce a polyhedral description of the interval constraint coloring problem, which serves as a basis to attack the problem by integer linear programming (ILP) methods and tools, which perform well in practice. Since the goal is to provide biochemists with all possible candidate solutions, we combine related solutions to equivalence classes in an improved ILP formulation in order to reduce the running time of our enumeration algorithm. Moreover, we establish the polynomial-time solvability of the two-color case by the integrality of the linear programming relaxation polytope $\mathcal{P}$, and also present a combinatorial polynomial-time algorithm for this case. We apply this algorithm as a subroutine to approximate solutions to instances with arbitrary but fixed number of colors and achieve an order of magnitude improvement in running time over the (exact) ILP approach. We show that the problem is $\mathcal{NP}$-complete for arbitrary number of colors, and we provide algorithms that, given an instance with $\mathcal{P}\neq\emptyset$, find a coloring that satisfies all the coloring requirements within $\pm 1$ of the prescribed value. In light of our $\mathcal{NP}$-completeness result, this is essentially the best one can hope for. Our approach is based on polyhedral theory and randomized rounding techniques. In practice, data emanating from the experiments are noisy, which normally causes the instance to be infeasible, and, in some cases, even forces $\mathcal{P}$ to be empty. To deal with this problem, the objective of the ILP is to minimize the total sum of absolute deviations from the coloring requirements over all intervals. The combinatorial approach for the two-color case optimizes the same objective function. Furthermore, we use this combinatorial method to compute, in a Lagrangian way, a bound on the minimum total error, which is exploited in a branch-and-bound manner to determine all optimal colorings. Alternatively, we study a variant of the problem in which we want to maximize the number of requirements that are satisfied. We prove that this variant is $\mathcal{APX}$-hard even in the two-color case and thus does not admit a polynomial time approximation scheme (PTAS) unless $\mathcal{P}=\mathcal{NP}$. Therefore, we slightly (by a factor of $(1+\epsilon)$) relax the condition on when a requirement is satisfied and propose a \emph{quasi-polynomial time approximation scheme} (QPTAS) which finds a coloring that ``satisfies'' the requirements of as many intervals as possible. [INFO:INFO_OH] Computer Science/Other Lagrangian Relaxation Computational Biology Multiple Sequence Alignment Interval Constrained Coloring
223	Protein and Drug Design Algorithms Using Improved Biophysical Modeling Hallen, Mark Andrew January 2016 (has links) <p>This thesis focuses on the development of algorithms that will allow protein design calculations to incorporate more realistic modeling assumptions. Protein design algorithms search large sequence spaces for protein sequences that are biologically and medically useful. Better modeling could improve the chance of success in designs and expand the range of problems to which these algorithms are applied. I have developed algorithms to improve modeling of backbone flexibility (DEEPer) and of more extensive continuous flexibility in general (EPIC and LUTE). I’ve also developed algorithms to perform multistate designs, which account for effects like specificity, with provable guarantees of accuracy (COMETS), and to accommodate a wider range of energy functions in design (EPIC and LUTE).</p> / Dissertation Computer science Biochemistry Algorithms Bioinformatics Combinatorial optimization Computational biology Drug design Protein design
224	From Sequence to Structure : Using predicted residue contacts to facilitate template-free protein structure prediction Michel, Mirco January 2017 (has links) Despite the fundamental role of experimental protein structure determination, computational methods are of essential importance to bridge the ever growing gap between available protein sequence and structure data. Common structure prediction methods rely on experimental data, which is not available for about half of the known protein families. Recent advancements in amino acid contact prediction have revolutionized the field of protein structure prediction. Contacts can be used to guide template-free structure predictions that do not rely on experimentally solved structures of homologous proteins. Such methods are now able to produce accurate models for a wide range of protein families. We developed PconsC2, an approach that improved existing contact prediction methods by recognizing intra-molecular contact patterns and noise reduction. An inherent problem of contact prediction based on maximum entropy models is that large alignments with over 1000 effective sequences are needed to infer contacts accurately. These are however not available for more than 80% of all protein families that do not have a representative structure in PDB. With PconsC3, we could extend the applicability of contact prediction to families as small as 100 effective sequences by combining global inference methods with machine learning based on local pairwise measures. By introducing PconsFold, a pipeline for contact-based structure prediction, we could show that improvements in contact prediction accuracy translate to more accurate models. Finally, we applied a similar technique to Pfam, a comprehensive database of known protein families. In addition to using a faster folding protocol we employed model quality assessment methods, crucial for estimating the confidence in the accuracy of predicted models. We propose models tobe accurate for 558 families that do not have a representative known structure. Out of those, over 75% have not been reported before. / <p>At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 2: Submitted. Paper 4: In press.</p><p> </p> protein bioinformatics protein structure prediction contact prediction machine learning Bioinformatics (Computational Biology) Bioinformatik (beräkningsbiologi)
225	Computational discovery of DNA methylation patterns as biomarkers of ageing, cancer, and mental disorders : Algorithms and Tools Torabi Moghadam, Behrooz January 2017 (has links) Epigenetics refers to the mitotically heritable modifications in gene expression without a change in the genetic code. A combination of molecular, chemical and environmental factors constituting the epigenome is involved, together with the genome, in setting up the unique functionality of each cell type. DNA methylation is the most studied epigenetic mark in mammals, where a methyl group is added to the cytosine in a cytosine-phosphate-guanine dinucleotides or a CpG site. It has been shown to have a major role in various biological phenomena such as chromosome X inactivation, regulation of gene expression, cell differentiation, genomic imprinting. Furthermore, aberrant patterns of DNA methylation have been observed in various diseases including cancer. In this thesis, we have utilized machine learning methods and developed new methods and tools to analyze DNA methylation patterns as a biomarker of ageing, cancer subtyping and mental disorders. In Paper I, we introduced a pipeline of Monte Carlo Feature Selection and rule-base modeling using ROSETTA in order to identify combinations of CpG sites that classify samples in different age intervals based on the DNA methylation levels. The combination of genes that showed up to be acting together, motivated us to develop an interactive pathway browser, named PiiL, to check the methylation status of multiple genes in a pathway. The tool enhances detecting differential patterns of DNA methylation and/or gene expression by quickly assessing large data sets. In Paper III, we developed a novel unsupervised clustering method, methylSaguaro, for analyzing various types of cancers, to detect cancer subtypes based on their DNA methylation patterns. Using this method we confirmed the previously reported findings that challenge the histological grouping of the patients, and proposed new subtypes based on DNA methylation patterns. In Paper IV, we investigated the DNA methylation patterns in a cohort of schizophrenic and healthy samples, using all the methods that were introduced and developed in the first three papers. DNA methylation machine learning biomarker cancer ageing classification Bioinformatics (Computational Biology) Bioinformatik (beräkningsbiologi)
226	IDENTIFICATION OF NOVEL SLEEP RELATED GENES FROM LARGE SCALE PHENOTYPING EXPERIMENTS IN MICE Joshi, Shreyas 01 January 2017 (has links) Humans spend a third of their lives sleeping but very little is known about the physiological and genetic mechanisms controlling sleep. Increased data from sleep phenotyping studies in mouse and other species, genetic crosses, and gene expression databases can all help improve our understanding of the process. Here, we present analysis of our own sleep data from the large-scale phenotyping program at The Jackson Laboratory (JAX), to identify the best gene candidates and phenotype predictors for influencing sleep traits. The original knockout mouse project (KOMP) was a worldwide collaborative effort to produce embryonic stem (ES) cell lines with one of mouse’s 21,000 protein coding genes knocked out. The objective of KOMP2 is to phenotype as many as of these lines as feasible, with each mouse studied over a ten-week period (www.mousephenotype.org). The phenotyping for sleep behavior is done using our non-invasive Piezo system for mouse activity monitoring. Thus far, sleep behavior has been recorded in more than 6000 mice representing 343 knockout lines and nearly 2000 control mice. Control and KO mice have been compared using multivariate statistical approaches to identify genes that exhibit significant effects on sleep variables from Piezo data. Using these statistical approaches, significant genes affecting sleep have been identified. Genes affecting sleep in a specific sex and that specifically affect sleep during daytime and/or night have also been identified and reported. The KOMP2 consists of a broad-based phenotyping pipeline that consists of collection of physiological and biochemical parameters through a variety of assays. Mice enter the pipeline at 4 weeks of age and leave at 18 weeks. Currently, the IMPC (International Mouse Phenotyping Consortium) database consists of more than 33 million observations. Our final dataset prepared by extracting biological sample data for whom sleep recordings are available consists of nearly 1.5 million observations from multitude of phenotyping assays. Through big data analytics and sophisticated machine learning approaches, we have been able to identify predictor phenotypes that affect sleep in mice. The phenotypes thus identified can play a key role in developing our understanding of mechanism of sleep regulation. Sleep Bioinformatics Gene-Phenotype Association KOMP2 Predictive Modeling Complex Traits Behavioral Neurobiology Bioinformatics Computational Biology Genetics
227	Stochastic models of ion channel dynamics and their role in short-term repolarisation variability in cardiac cells Dangerfield, C. E. January 2012 (has links) Sudden cardiac death due to the development of lethal arrhythmias is the dominant cause of mortality in the UK, yet the mechanisms underlying their onset, maintenance and termination are still poorly understood. Therefore biomarkers are used to determine arrhythmic risk within patients and of new drug compounds. In recent years, the magnitude of variations in the length of successive beats, measured over a short period of time, has been shown to be a powerful predictor of arrhythmic risk. This beat-to-beat variability is thought to be the manifestation of the random opening and closing dynamics of individual ion channels that lie within the membrane of cardiac cells. Computational models have become an important tool in understanding the electrophysiology of the heart. However, current state-of-the-art electrophysiology models do not incorporate this intrinsic stochastic behaviour of ion channels. Those that do use computationally costly methods, restricting their use in complex tissue scale simulations, or employ stochastic simulation methods that result in negative numbers of channels and so are inaccurate. Therefore, using current stochastic modelling techniques to investigate the role of stochastic ion channel behaviour in beat-to-beat variability presents difficulties. In this thesis we take a mathematically rigorous and novel approach to develop accurate and computationally efficient models of stochastic ion channel dynamics that can be incorporated into existing electrophysiology models. Two different models of stochastic ion channel behaviour, both based on a system of stochastic differential equations (SDEs), are developed and compared. The first model is based on an existing SDE model from population dynamics called the Wright-Fisher model. The second approach incorporates boundary conditions into the SDE model of ion channel dynamics that is obtained in the limit from the discrete-state Markov chain model, and is called a reflected SDE. Of these two methods, the reflected SDE is found to more accurately capture the stochastic dynamics of the discrete-stateMarkov chain, seen as the ‘gold-standard’ model and also provides substantial computational speed up. Thus the reflected SDE is an accurate and efficient model of stochastic ion channel dynamics and so allows for detailed investigation into beat-to-beat variability using complex computational electrophysiology models. We illustrate the potential power of this method by incorporating it into a state-of-the-art canine cardiac cell electrophsyiology model so as to explore the effects of stochastic ion channel behaviour on beat-to-beat variability. The stochastic models presented in this thesis fulfil an important role in elucidating the effects of stochastic ion channel behaviour on beat-to-beat variability, a potentially important biomarker of arrhythmic risk. 616.128
228	Metabolic pathway analysis via integer linear programming Planes, Francisco J. January 2008 (has links) The understanding of cellular metabolism has been an intriguing challenge in classical cellular biology for decades. Essentially, cellular metabolism can be viewed as a complex system of enzyme-catalysed biochemical reactions that produces the energy and material necessary for the maintenance of life. In modern biochemistry, it is well-known that these reactions group into metabolic pathways so as to accomplish a particular function in the cell. The identification of these metabolic pathways is a key step to fully understanding the metabolic capabilities of a given organism. Typically, metabolic pathways have been elucidated via experimentation on different organisms. However, experimental findings are generally limited and fail to provide a complete description of all pathways. For this reason it is important to have mathematical models that allow us to identify and analyze metabolic pathways in a computational fashion. This is precisely the main theme of this thesis. We firstly describe, review and discuss existent mathematical/computational approaches to metabolic pathways, namely stoichiometric and path finding approaches. Then, we present our initial mathematical model named the Beasley-Planes (BP) model, which significantly improves on previous stoichiometric approaches. We also illustrate a successful application of the BP model to optimally disrupt metabolic pathways. The main drawback of the BP model is that it needs as input extra pathway knowledge. This is especially inappropriate if we wish to detect unknown metabolic pathways. As opposed to the BP model and stoichoimetric approaches, this issue is not found in path finding approaches. For this reason a novel path finding approach is built and examined in detail. This analysis serves us as inspiration to build the Improved Beasley-Planes (IBP) model. The IBP model incorporates elements of both stoichometric and path finding approaches. Though somewhat less accurate than the BP model, the IBP model solves the issue of extra pathway knowledge. Our research clearly demonstrates that there is a significant chance of developing a mathematical optimisation model that underlies many/all metabolic pathways. 572.80285
229	MYSTERIES OF THE TRYPANOSOMATID MAXICIRCLES: CHARACTERIZATION OF THE MAXICIRCLE GENOMES AND THE EVOLUTION OF RNA EDITING IN THE ORDER KINETOPLASTIDA Iyengar, Preethi Ranganathan 01 January 2015 (has links) The trypanosomatid protists belonging to Order Kinetoplastida are some of the most successful parasites ever known to mankind. Their extreme physiological diversity and adaptability to different environmental conditions and host systems make them some of the most widespread parasites, causing deadly diseases in humans and other vertebrates. This project focuses on their unique mitochondrion, called the kinetoplast, and more specifically involves the characterization of a part of their mitochondrial DNA (also called kinetoplast DNA or kDNA), the maxicircles, which are functional homologs of eukaryotic mitochondrial DNA in the kinetoplastid protists. We have sequenced and characterized the maxicircle genomes of 20 new trypanosomatids and compared them with 8 previously published maxicircle genomes of other trypanosomatids. Transcripts of ~13 of the 20 total genes in these maxicircles undergo post-transcriptional modifications involving the insertion and deletion of U residues at precise sites, to yield the final, fully-edited, translatable mRNA. We have deciphered the diverse patterns and extents of RNA editing of each edited gene in the maxicircle of each organism, and inferred the sequences of the putative fully edited mitochondrial transcripts and proteins. Using a binary value - based strategy (1/0), we quantified the RNA editing in all these trypanosomatids and estimated the evolution of RNA editing in the group. Additionally, we conducted phylogenetic analyses using a subset of unedited maxicircle genes to predict the relationships between the various trypanosomatids in this project, and compared them to the previously published nuclear gene-based phylogenies. For convenience of analysis, the 28 total trypanosomatids in this work were divided into two groups: the first group consisting of the endosymbiont-bearing and related insect trypanosomatids, which constitute the first half of the project, and the second group consisting of trypanosomatids of the Trypanosoma genus, including T. cruzi-related and unrelated parasites, constituting the latter half of the project. In summary, most of the trypanosomatid maxicircles showed a syntenic panel of 20 protein-coding genes (excluding any guide RNA genes), beginning with the mitochondrial ribosomal genes and ending with the gene encoding NADH dehydrogenase-5. Although some genes were partially or completely absent in the maxcircles of some species, the remaining genes were completely syntenic. The total number of genes edited and their editing patterns varied considerably among the first group of insect trypanosomatids, but were remarkably similar in the second group of the Trypanosoma genus. On a broad scale, the mitochondrial phylogeny reflects the nuclear phylogeny for these trypanosomatids, except within the T. cruzi population. Similarly, RNA editing appears to have evolved in parallel with the nuclear genes, although subtle differences are again noticeable within the T. cruzi family. maxicircle trypanosomatid kinetoplast endosymbiont RNA editing evolution Bioinformatics Biology Computational Biology Genetics and Genomics Microbiology Molecular Biology
230	SYSTEM GENETIC ANALYSIS OF MECHANISMS UNDERLYING EXCESSIVE ALCOHOL CONSUMPTION Smith, Maren L. 01 January 2016 (has links) Increased alcohol consumption over time is one of the characteristic symptoms of Alcohol Use Disorder (AUD). The molecular mechanisms underlying this escalation in intake is still the subject of study. However, the mesocortical and mesolimbic dopamine pathways, and the extended amygdala, because of their involvement in reward and reinforcement are believed to play key roles in these behavioral changes. Multiple gene expression studies have shown that alcohol affects the expression of thousands of genes in the brain. The studies discussed in this document use the systems biology technique of co-expression network analysis to attempt to find patterns within genome-wide expression data from two animal models of chronic, high-dose ethanol exposure. These analyses have identified time-dependent and brain-regions specific patterns of expression in C57Bl/6J mice after multiple exposures to intoxicating doses of ethanol and withdrawal. Specifically, they have identified the PFC and HPC as showing long-term ethanol regulation, and identified Let-7 family miRNAs as potential gene expression regulators of chronic ethanol response. Network analysis also indicates neurotransmitter release and neuroimmune response are very correlated to ethanol intake in chronically exposed mice. Examining gene expression response to chronic ethanol exposure across a variable genetic background revealed that, although gene expression response may show conserved patterns, underlying differences in gene expression influence by genetic background may be what truly underlies voluntary ethanol consumption. Finally, combined network analysis of gene expression in the prefrontal cortex (PFC) of mice and macaques following prolonged ethanol exposure demonstrated that neurotransmission, myelination, transcription, cellular respiration, and, possibly, neurovasculature are affected by chronic ethanol across species. Taken together, these studies generate several new hypothesis and areas of future research into the continued study of druggable targets for AUD. alcohol network analysis WGCNA mouse CIE microarray Bioinformatics Computational Biology Genetics Genomics

Search results