1 |
Applications and extensions of Random Forests in genetic and environmental studiesMichaelson, Jacob 10 January 2011 (has links) (PDF)
Transcriptional regulation refers to the molecular systems that control the concentration of mRNA species within the cell. Variation in these controlling systems is not only responsible for many diseases, but also contributes to the vast phenotypic diversity in the biological world. There are powerful experimental approaches to probe these regulatory systems, and the focus of my doctoral research has been to develop and apply effective computational methods that exploit these rich data sets more completely. First, I present a method for mapping genetic regulators of gene expression (expression quantitative trait loci, or eQTL) using Random Forests. This approach allows for flexible modeling and feature selection, and results in eQTL that are more biologically supportable than those mapped with competing methods. Next, I present a method that finds interactions between genes that in turn regulate the expression of other genes. This is accomplished by finding recurring decision motifs in the forest structure that represent dependencies between genetic loci. Third, I present a method to use distributional differences in eQTL data to establish the regulatory roles of genes relative to other disease-associated genes. Using this method, we found that genes that are master regulators of other disease genes are more likely to be consistently associated with the disease in genetic association studies. Finally, I present a novel application of Random Forests to determine the mode of regulation of toxin-perturbed genes, using time-resolved gene expression. The results demonstrate a novel approach to supervised weighted clustering of gene expression data.
|
2 |
Functional characterization of proteins involved in cell cycle by structure-based computational methodsSontheimer, Jana 14 May 2012 (has links) (PDF)
In the recent years, a rapidly increasing amount of experimental data has been generated by high-throughput technologies. Despite of these large quantities of protein-related data and the development of computational prediction methods, the function of many proteins is still unknown. In the human proteome, at least 20% of the annotated proteins are not characterized. Thus, the question, how to predict protein function from its amino acid sequence, remains to be answered for many proteins. Classical bioinformatics approaches for function prediction are based on inferring function from well-characterized homologs, which are identified based on sequence similarity. However, these methods fail to identify distant homologs with low sequence similarity. As protein structure is more conserved than sequence in protein families, structure-based methods (e.g. fold recognition) may recognize possible structural similarities even at low sequence similarity and therefore provide information for function inference. These fold recognition methods have already been proven to be successful for individual proteins, but their automation for high-throughput application is difficult due to intrinsic challenges of these techniques, mainly caused by a high false positive rate. Automated identification of remote homologs based on fold recognition methods would allow a signi cant improvement in functional annotation of proteins. My approach was to combine structure-based computational prediction methods with experimental data from genome-wide RNAi screens to support the establishment of functional hypotheses by improving the analysis of protein structure prediction results.
In the first part of my thesis, I characterized proteins from the Ska complex by computational methods. I showed the benefit of including experimental information to identify remote homologs: Integration of functional data helped to reduce the number of false positives in fold recognition results and made it possible to establish interesting functional hypotheses based on high con dence structural predictions. Based on the structural hypothesis of a GLEBS motif in c13orf3 (Ska3), I could derive a potential molecular mechanism that could explain the observed phenotype.
In the second part of my thesis, my goal was to develop computational tools and automated analysis techniques to be able to perform structure-based functional annotation in a high-throughput way. I designed and implemented key tools that were successfully integrated into a computational platform, called StrAnno, which I set up together with my colleagues. These novel computational modules include a domain prediction algorithm and a graphical overview that facilitates and accelerates the analysis of results.
StrAnno can be seen as a first step towards automatic functional annotation of proteins by structure-based methods. First, the analysis of long hit lists to identify promising candidates for further analysis is substantially facilitated by integration and combination of various sequence-based computational tools and data from functional databases. Second, the developed post-processing tools accelerate the evaluation of structural and functional hypotheses. False positives from the threading result lists are removed by various filters, and analysis of the possible true positives is greatly enhanced by the graphical overview. With these two essential benefits, fold recognition techniques are applicable to large-scale approaches. By applying this developed methodology to hits from a genome-wide cell cycle RNAi screen and evaluating structural hypotheses by molecular modeling techniques, I aimed to associate biological functions to human proteins and link the RNAi phenotype to a molecular function. For two selected human proteins, c20orf43 and HJURP, I could establish interesting structural and functional hypotheses. These predictions were based on templates with low sequence identity (10-20%). The uncharacterized human protein c20orf43 might be a E3 SUMO-ligase that could be involved either in DNA repair or rRNA regulatory processes. Based on the structural hypotheses of two domains of HJURP, I predicted a potential link to ubiquitylation processes and direct DNA binding. In addition, I substantiated the cell cycle arrest phenotype of these two genes upon RNAi knockdown.
Fold recognition methods are a promising alternative for functional annotation of proteins that escape sequence-based annotation due to their low sequence identity to well-characterized protein families. The structural and functional hypotheses I established in my thesis open the door to investigate the molecular mechanisms of previously uncharacterized proteins, which may provide new insights into cellular mechanisms.
|
3 |
Novel concepts for lipid identification from shotgun mass spectra using a customized query languageHerzog, Ronny 23 August 2012 (has links) (PDF)
Lipids are the main component of semipermeable cell membranes and linked to several important physiological processes. Shotgun lipidomics relies on the direct infusion of total lipid extracts from cells, tissues or organisms into the mass spectrometer and is a powerful tool to elucidate their molecular composition. Despite the technical advances in modern mass spectrometry the currently available software underperforms in several aspects of the lipidomics pipeline. This thesis addresses these issues by presenting a new concept for lipid identification using a customized query language for mass spectra in combination with efficient spectra alignment algorithms which are implemented in the open source kit “LipidXplorer”.
|
4 |
Protein interactions in disease: Using structural protein interactions and regulatory networks to predict disease-relevant mechanismsWinter, Christof Alexander 17 January 2012 (has links) (PDF)
Proteins and their interactions are fundamental to cellular life. Disruption of protein-protein, protein-RNA, or protein-DNA interactions can lead to disease, by affecting the function of protein complexes or by affecting gene regulation. A better understanding of these interactions on the molecular level gives rise to new methods to predict protein interaction, and is critical for the rational design of new therapeutic agents that disrupt disease-causing interactions. This thesis consists of three parts that focus on various aspects of protein interactions and their prediction in the context of disease.
In the first part of this thesis, we classify interfaces of protein-protein interactions. We do so by systematically computing all binding sites between protein domains in protein complex structures solved by X-ray crystallography. The result is SCOPPI, the Structural Classification of Protein Protein Interfaces. Clustering and classification of geometrically similar interfaces reveals interesting examples comprising viral mimicry of human interface binding sites, gene fusion events, conservation of interface residues, and diversity of interface localisations. We then develop a novel method to predict protein interactions which is based on these structural interface templates from SCOPPI. The method is applied in three use cases covering osteoclast differentiation, which is relevant for osteoporosis, the microtubule-associated network in meiosis, and proteins found deregulated in pancreatic cancer. As a result, we are able to reconstruct many interactions known to the expert molecular biologist, and predict novel high confidence interactions backed up by structural or experimental evidence. These predictions can facilitate the generation of hypotheses, and provide knowledge on binding sites of promising disease-relevant candidates for targeted drug development.
In the second part, we present a novel algorithm to search for protein binding sites in RNA sequences. The algorithm combines RNA structure prediction with sequence motif scanning and evolutionary conservation to identify binding sites on candidate messenger RNAs. It is used to search for binding sites of the PTBP1 protein, an important regulator of glucose secretion in the pancreatic beta cell. First, applied to a benchmark set of mRNAs known to be regulated by PTBP1, the algorithm successfully finds significant binding sites in all benchmark mRNAs. Second, collaborators carried out a screen to identify changes in the proteome of beta cells upon glucose stimulation while inhibiting gene expression. Analysing this set of post-transcriptionally controlled candidate mRNAs for PTBP1 binding, the algorithm produced a ranked list of 11 high confident potential PTBP1 binding sites. Experimental validation of predicted targets is ongoing. Overall, identifying targets of PTBP1 and hence regulators of insulin secretion may contribute to the treatment of diabetes by providing novel protein drug targets or by aiding in the design of novel RNA-binding therapeutics.
The third part of this thesis deals with gene regulation in disease. One of the great challenges in medicine is to correlate genotypic data, such as gene expression measurements, and other covariates, such as age or gender, to a variety of phenotypic data from the patient. Here, we address the problem of survival prediction based on microarray data in cancer patients. To this end, a computational approach was devised to find genes in human cancer tissue samples whose expression is predictive for the survival outcome of the patient. The central idea of the approach is the incorporation of background knowledge information in form of a network, and the use of an algorithm similar to Google s PageRank. Applied to pancreas cancer, it identifies a set of eight genes that allows to predict whether a patient has a poor or good prognosis. The approach shows an accuracy comparable to studies that were performed in breast cancer or lymphatic malignancies. Yet, no such study was done for pancreatic cancer. Regulatory networks contain information of transcription factors that bind to DNA in order to regulate genes. We find that including background knowledge in form of such regulatory networks gives highest improvement on prediction accuracy compared to including protein interaction or co-expression networks. Currently, our collaborators test the eight identified genes for their predictive power for survival in an independent group of 150 patients. Under a therapeutic perspective, reliable survival prediction greatly improves the correct choice of therapy. Whereas the live expectancy of some patients might benefit from extensive therapy such as surgery and chemotherapy, for other patients this may only be a burden. Instead, for this group, a less aggressive or different treatment could result in better quality of the remaining lifetime.
Conclusively, this thesis contributes novel analytical tools that provide insight into disease-relevant interactions of proteins. Furthermore, this thesis work contributes a novel algorithm to deal with noisy microarray measurements, which allows to considerably improve prediction of survival of cancer patients from gene expression data.
|
5 |
Integration and analysis of phenotypic data from functional screensPaszkowski-Rogacz, Maciej 10 January 2011 (has links) (PDF)
Motivation: Although various high-throughput technologies provide a lot of valuable information, each of them is giving an insight into different aspects of cellular activity and each has its own limitations. Thus, a complete and systematic understanding of the cellular machinery can be achieved only by a combined analysis of results coming from different approaches. However, methods and tools for integration and analysis of heterogenous biological data still have to be developed.
Results: This work presents systemic analysis of basic cellular processes, i.e. cell viability and cell cycle, as well as embryonic stem cell pluripotency and differentiation. These phenomena were studied using several high-throughput technologies, whose combined results were analysed with existing and novel clustering and hit selection algorithms.
This thesis also introduces two novel data management and data analysis tools. The first, called DSViewer, is a database application designed for integrating and querying results coming from various genome-wide experiments. The second, named PhenoFam, is an application performing gene set enrichment analysis by employing structural and functional information on families of protein domains as annotation terms. Both programs are accessible through a web interface.
Conclusions: Eventually, investigations presented in this work provide the research community with novel and markedly improved repertoire of computational tools and methods that facilitate the systematic analysis of accumulated information obtained from high-throughput studies into novel biological insights.
|
6 |
Semi-automated Ontology Generation for Biocuration and Semantic SearchWächter, Thomas 01 February 2011 (has links) (PDF)
Background:
In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed.
Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing.
Motivation:
The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences.
Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods.
Results:
The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results.
To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org.
|
Page generated in 0.0271 seconds