1 |
Gene Ontology-based framework to annotate genes of hearingOvezmyradov, Guvanchmyrat 23 October 2012 (has links)
No description available.
|
2 |
Analyse bioinformatique du génome et de l’épigénome du pommier / Bioinformatic analysis of the apple genome and epigenomeDaccord, Nicolas 27 November 2018 (has links)
La pomme est l’un des fruits les plus consommés au monde. En utilisant les dernières technologies de séquençage (PacBio) et de cartes optiques (BioNano), nous avons généré un assemblage de novo de haute qualité du génome du pommier (Malus domestica Borkh.). Nous avons réalisé une annotation des gènes et des éléments transposables pour permettre à cet assemblage d’être utilisé en tant que génome de référence. La grande contiguité de l’assemblage a permis de détecter les éléments transposables de façon exhaustive, ce qui fournit une opportunité sans précédents d’étudier les régions non-caractérisées d’un génome d’arbre. Nous avons également trouvé que le génome du pommier est entièrement dupliqué, comme montré par les relations de synthénie entre les chromosomes. En utilisant du Whole Genome Bisulfite Sequencing (WGBS) ainsi que l’assemblage précédemment généré, nous avons montré des cartes de méthylation de l’ADN pour tout le génome et montré une corrélation générale entre la méthylation de l’ADN près des gènes et l’expression des gènes. De plus, nous avons identifié plusieurs Régions Différentiellement Méthylées (RDMs) entre les méthylomes de fruits et de feuilles du pommier, associées à des gènes candidats qui pourraient être impliqués dans des traits agronomiques importants tel que le développement du fruit. Enfin, nous avons développé un pipeline rapide, simple et complet qui prend entièrement en charge l’analyse des données WGBS, de l’alignement des reads au calcul des RDMs. / Apple is one of the most consumed fruits in the world. Using the latest sequencing (PacBio) and optical mapping (BioNano) technologies, we have generated a high-quality de novo assembly of the apple (Malus domestica Borkh.) genome. We performed a gene annotation as well as a transposable element annotation to allow this assembly to be used as a reference genome. The highcontiguity of the assembly allowed to exhaustively detect the transposable elements, which represented over half the assembly, thus providing an unprecedented opportunity to investigate the uncharacterized regions of a tree genome. We also found that the apple genome is entirely duplicated as showed by the synteny links between chromosomes. Using Whole Genome Bisulfite Sequencing (WGBS) and the previously generated assembly, we produced genome-wide DNA methylation maps and showed a general correlation between DNA methylation next to genes and gene expression. Moreover, we identified several Differentially Methylated Regions (DMRs) between apple fruits and leaf methylomes associated to candidate genes that could be involved in agronomically relevant traits such as apple fruit development. Finally, we developped a complete and easyto- use pipeline which aim is to handle the complete treatment of WGBS data, from the reads mapping to the DMRs computing. It can handle datasets having a low number of biological replicates.
|
3 |
Automatic Assignment of Protein Function with Supervised ClassifiersJung, Jae 16 January 2010 (has links)
High-throughput genome sequencing and sequence analysis technologies have
created the need for automated annotation and analysis of large sets of genes. The
Gene Ontology (GO) provides a common controlled vocabulary for describing gene
function. However, the process for annotating proteins with GO terms is usually
through a tedious manual curation process by trained professional annotators. With
the wealth of genomic data that are now available, there is a need for accurate auto-
mated annotation methods.
The overall objective of my research is to improve our ability to automatically an-
notate proteins with GO terms. The first method, Automatic Annotation of Protein
Functional Class (AAPFC), employs protein functional domains as features and learns
independent Support Vector Machine classifiers for each GO term. This approach relies only on protein functional domains as features, and demonstrates that statistical
pattern recognition can outperform expert curated mapping of protein functional
domain features to protein functions. The second method Predict of Gene Ontology
(PoGO) describes a meta-classification method that integrates multiple heterogeneous
data sources. This method leads to improved performance than the protein domain
method can achieve alone.
Apart from these two methods, several systems have been developed that employ pattern recognition to assign gene function using a variety of features, such as the sequence similarity, presence of protein functional domains and gene expression
patterns. Most of these approaches have not considered the hierarchical relationships
among the terms in the form of a directed acyclic graph (DAG). The DAG represents
the functional relationships between the GO terms, thus it should be an important
component of an automated annotation system. I describe a Bayesian network used as
a multi-layered classifier that incorporates the relationships among GO terms found in
the GO DAG. I also describe an inference algorithm for quickly assigning GO terms
to unlabeled proteins. A comparative analysis of the method to other previously
described annotation systems shows that the method provides improved annotation
accuracy when the performance of individual GO terms are compared. More importantly, this method enables the classification of significantly more GO terms to more
proteins than was previously possible.
|
4 |
Genome-wide analysis of mutually exclusive splicingHatje, Klas 29 January 2013 (has links)
No description available.
|
5 |
Single Amplified Genomes as Source for Novel Extremozymes: Annotation, Expression and Functional AssessmentGrötzinger, Stefan 12 1900 (has links)
Enzymes, as nature’s catalysts, show remarkable abilities that can revolutionize the chemical, biotechnological, bioremediation, agricultural and pharmaceutical industries. However, the narrow range of stability of the majority of described biocatalysts limits their use for many applications. To overcome these restrictions, extremozymes derived from microorganisms thriving under harsh conditions can be used. Extremophiles living in high salinity are especially interesting as they operate at low water activity, which is similar to conditions used in standard chemical applications. Because only about 0.1 % of all microorganisms can be cultured, the traditional way of culture-based enzyme function determination needs to be overcome. The rise of high-throughput next-generation-sequencing technologies allows for deep insight into nature’s variety. Single amplified genomes (SAGs) specifically allow for whole genome assemblies from small sample volumes with low cell yields, as are typical for extreme environments. Although these technologies have been available for years, the expected boost in biotechnology has held off. One of the main reasons is the lack of reliable functional annotation of the genomic data, which is caused by the low amount (0.15 %) of experimentally described genes. Here, we present a novel annotation algorithm, designed to annotate the enzymatic function of genomes from microorganisms with low homologies to described microorganisms. The algorithm was established on SAGs from the extreme environment of selected hypersaline Red Sea brine pools with 4.3 M salinity and temperatures up to 68°C. Additionally, a novel consensus pattern for the identification of γ-carbonic anhydrases was created and applied in the algorithm. To verify the annotation, selected genes were expressed in the hypersaline expression system Halobacterium salinarum. This expression system was established and optimized in a continuously stirred tank reactor, leading to substantially increased cell amounts and protein yields. The resulting gene expression products were assessed for function in vivo and/or in vitro. Our functional evaluation of the tested genes confirmed our annotation algorithm. Our developed strategy offers a general guide for using SAGs as a source of scientific and industrial investigations into “microbial dark matter” and may help to develop new catalysts, applicable for novel reactions in green chemistry.
|
6 |
High quality gene annotation for deep phylogenetic analysisIndrischek, Henrike 27 August 2018 (has links)
Gene prediction in newly sequenced genomes is a known challenging. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple very similar paralogs are present.
The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds rather than to chromosomes. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein-coding genes are needed in particular for phylogenetics and for the analysis of gene family histories.
In this dissertation, I established a tool, the ExonMatchSolver-pipeline (EMS-pipeline), that can assist the assembly of genes distributed across multiple fragments (e.g. contigs). The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. The EMS-pipeline accommodates a homology search step with a protein input set consisting of several highly similar paralogs as query. The core of the pipeline uses an Integer Linear Programming Implementation to solve the paralog-to-contig assignment problem. An extension to the initial implementation estimates the number of paralogs encoded in the target genome and can handle several paralogs that are situated on the same genomic fragment.
The EMS-pipeline was successfully applied to simulated data, several showcase examples and to deuterostome genomes in a large scale study on the evolution of the arrestin protein family. Especially at high genome fragmentation levels, the tool outperformed a naive assignment method.
Arrestins are key signaling transducers that bind to activated and phosphorylated G protein-coupled receptors and can mediate their endocytosis into the cell. The refined annotations of arrestins resulting from the application of the EMS-pipeline are more complete and accurate in comparison to a conventional database search strategy. With the applied strategy it was possible to map the duplication- and deletion history of arrestin paralogs including tandem duplications, pseudogenizations and the formation of retrogenes in detail.
My results support the emergence of the four arrestin paralogs from a visual and a non-visual proto-arrestin. Surprisingly, the visual ARR3 was lost in the mammalian clades afrotherians and xenarthrans. Segmental duplications in specific clades and the 3R-WGD in the teleost stem lineage, on the other hand, must have given rise to new
paralogs that show signatures of diversification in functional elements important for receptor binding and phosphate sensing. The four vertebrate orthology groups show an interesting pattern of divergence of three endocytosis motifs: the minor and major clathrin binding site and the adapter protein-2 (AP-2) binding motif.
Identification of such signatures, of residues that determine specificity between paralogs and are positively selected after duplication was made possible by high quality alignments obtained by genome inquiries, dense species sampling and consideration of fragmented loci from poorly assembled genomes in the framework of the EMS-pipeline, that was established in this dissertation.:1 Introduction 2
1.1 Basics and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 What is a gene? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 What is a tree in phylogenetics? . . . . . . . . . . . . . . . . . . 3
1.1.3 What are paralogs and orthologs? . . . . . . . . . . . . . . . . . 4
1.1.4 Central dogma in molecular biology: From DNA to protein . . 5
1.2 Gene duplications as evolutionary playground . . . . . . . . . . . . . . 12
1.2.1 Mechanisms of gene duplication . . . . . . . . . . . . . . . . . . 13
1.2.2 Evolutionary fate of duplicated genes . . . . . . . . . . . . . . . 14
1.3 Identification and annotation of protein homologs . . . . . . . . . . . . 15
1.3.1 Challenges of existing resources . . . . . . . . . . . . . . . . . . 16
1.3.2 Similarity search approaches without consideration of the gene
structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Gene structure aware gene annotation approaches . . . . . . . . 19
1.3.4 Graph-based inference of orthology relationships . . . . . . . . 21
1.3.5 Chance and challenge of fragmented assemblies . . . . . . . . . 21
1.4 Applied phylogenetic methods . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Phylogenetic inference in a nutshell . . . . . . . . . . . . . . . . 23
1.4.2 Inference of natural selection in inter-species data sets . . . . . 29
1.4.3 Detection of specificity determining positions . . . . . . . . . . 32
1.5 Multi-talents in cell signaling: The cytosolic arrestin proteins . . . . . . 34
1.5.1 Functions of arrestins in cell signaling . . . . . . . . . . . . . . . 34
1.5.2 Arrestin activation by GPCR binding . . . . . . . . . . . . . . . 36
1.5.3 Functions of arrestins in cellular trafficking . . . . . . . . . . . . 37
1.5.4 Evolution of arrestins . . . . . . . . . . . . . . . . . . . . . . . . 39
2 The ExonMatchSolver-pipeline 42
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.1 Pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.2 Exon assembly as an assignment problem . . . . . . . . . . . . . 43
2.2.3 Solving the Paralog-to-Contig Assignment Problem . . . . . . . 46
2.2.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.5 Implementation and usage . . . . . . . . . . . . . . . . . . . . . 48
2.2.6 Performance assessment by simulations . . . . . . . . . . . . . . 50
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Performance on simulated data . . . . . . . . . . . . . . . . . . . 50
2.3.2 Performance on real data - Two Showcase Examples . . . . . . . 51
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Evolution of the arrestin protein family in deuterostomes 61
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Database scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.2 Detailed gene annotation . . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Data resources used in the current study . . . . . . . . . . . . . 64
3.2.4 Alignment and building of phylogenetic trees . . . . . . . . . . 64
3.2.5 Identification of specificity determining positions . . . . . . . . 65
3.2.6 Testing for natural selection . . . . . . . . . . . . . . . . . . . . . 66
3.2.7 Assessement of conservation . . . . . . . . . . . . . . . . . . . . 66
3.2.8 Parsimonious reconstruction of exon gain and loss events . . . 67
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.1 Evolution of the arrestin fold family based on database inquiries 67
3.3.2 The refined arrestin annotations are more complete than database
entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 Arrestin paralog gain and loss patterns based on the refined
annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Evolution of arrestin functional elements . . . . . . . . . . . . . 88
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.1 Limitation of arrestin database annotations . . . . . . . . . . . . 96
3.4.2 Arrestins in early vertebrate evolution . . . . . . . . . . . . . . . 98
3.4.3 Sub- and neofunctionalization as consequence of the 3R-WGD . 102
3.4.4 Independent arrestin duplications in deuterostomes . . . . . . . 104
3.4.5 Loss of arrestin paralogs in different vertebrate orders . . . . . 106
3.4.6 Previously unknown interaction partners and isoforms . . . . . 108
4 Improvements on the ExonMatchSolver-pipeline 110
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2.1 Estimation of the paralog number . . . . . . . . . . . . . . . . . 111
4.2.2 Subdivision of gene loci on the same contig . . . . . . . . . . . . 113
4.2.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.4 Assessment of the ExonMatchSolver-pipeline Version 2 . . . 115
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5 Conclusion and Outlook 119
A Additional figures 123
B Additional tables 134
C CV 152
Bibliography 156
|
7 |
Role of cytochromes P450 in wine aroma / Rôle des cytochromes P450 dans l’arôme du vinIlc, Tina 18 December 2015 (has links)
L’annotation détaillée de la superfamille des cytochromes P450 dans le génome de la vigne nous a permis d’étudier sa structure génétique, sa phylogénie et son expression, mais aussi d’identifier des gènes dont l’expression est activée dans le grain à maturité, lors de la synthèse de nombreux composés aromatiques. La lactone du vin est la molécule dont le seuil de détection olfactive est le plus bas, ce qui en fait un composant essentiel de l’arôme du vin. Nous avons pu démontrer que cette lactone se forme au cours du vieillissement du vin par une réaction lente et non-enzymatique à partir du 8-carboxylinalool. L’accumulation de ce dernier dans la baie est concomitante à l’expression de plusieurs P450s, dont CYP76F14 est le plus fortement exprimé. Trois enzymes catalysent des étapes d’oxydation conduisant du linalool au (E)-8-carboxylinalool, mais seul CYP76F14 catalyse efficacement la formation de l’acide. Tant par son activité catalytique que son profil d’expression, CYP76F14 apparaît donc comme le responsable le plus probable de la formation du précurseur de la lactone du vin. / A thorough annotation of the P450 superfamily in grapevine, revealed its genomic organization, phylogeny and expression. Specifically, we identified genes showing an activated expression in the ripe grape berry, the stage during which the biosynthesis of many aroma compounds takes place. Among the known oxygenated monoterpenols in grapevine, wine lactone has the lowest odor detection threshold and therefore the largest potential impact on wine aroma. We demonstrated that wine lactone is formed during wine ageing via a slow non-enzymatic reaction from the precursor (E)-8-carboxylinalool. We showed that the accumulation of this precursor in grape berries parallels the expression of several cytochrome P450 genes, among which CYP76F14 has the highest expression. While three of them catalyzed some of the oxidative steps from linalool to (E)-8-carboxylinalool, only CYP76F14 efficiently catalyzed the whole pathway. Taken together, CYP76F14 catalytic activity and expression pattern indicate that it is a prime candidate for the formation of the wine lactone precursor in grape berries.
|
8 |
Genetic Background and Biomarkers of Atrial Fibrillation Progression and RecurrenceBüttner, Petra 28 August 2019 (has links)
Vorhofflimmern (VHF) ist eine progressive Krankheit, die sich morphologisch als fibrotische Remodellierung des Vorhofs manifestiert und klinisch durch einen Wechsel von paroxysmalem zu persistierendem VHF, eine Vergrößerung des links-atrialen Diameters und die intra-prozedurale Detektion von atrialen Bereichen mit niedrigen Potentialen gekennzeichnet ist. VHF-Progression ist mit schlechteren Therapieergebnissen, z.B. einer höheren VHF-Rezidivrate assoziiert. Zugrunde liegende Pathomechanismen sind unvollständig charakterisiert, geeignete Biomarker zur individuellen therapeutischen Stratifizierung sind nicht bekannt.
Die Mehrzahl häufiger genetischer Varianten, die mit der VHF-Progression assoziiert sind, hat nur sehr kleine Effekte. Viele Varianten könnten hingegen additiv die Progression von VHF modulieren. Dieser Hypothese folgend wurden mit VHF-Progression assoziierte Varianten identifiziert und deren nicht-zufällige Häufung in Gen-Loci als Indiz für eine kontextuelle Relevanz des jeweiligen Gens gewertet. Die identifizierten Gene wurden der Kalzium-Signaltransduktion und der extrazelluläre Matrix (ECM)-Rezeptor-Interaktion zugeordnet. Zusätzlich wurden die zentralen Regulatoren dieser Signalwege, namentlich EGFR, RYR2, PRKCA, FN1 und LAMA1 identifiziert, die als pharmakologische Ziele in Frage kommen bzw. hinsichtlich ihrer Rolle bei der VHF-Progression untersucht werden müssen. Mit einer vergleichbaren Herangehensweise wurde gezeigt, dass die mit der Manifestation von VHF assoziierten Gene ZFHX3, ITGA9 und SOX5 auch mit der VHF-Progression assoziiert sind.
Eine Analyse potentieller Biomarker identifizierte NT-pro ANP als spezifischen Marker mit direkter Korrelation zum Progressionsgrad des VHF. Zusätzlich wurde für die Marker NT-proANP, NT-proBNP und VCAM1 ein stufenweise signifikanter Anstieg korrelierend mit einem klinischen Wert zur Prognose von VHF-Rezidiven gezeigt.
Durch die Anwendung unkonventioneller Konzepte und der Verwendung spezifischer Charakteristika der VHF-Progression konnten in der vorliegenden Arbeit potentielle Regulatoren und Biomarker der VHF-Progression identifiziert werden.:1. Introduction 5
1.1. Atrial fibrillation incidence and associated risk 5
1.2. Atrial fibrillation therapy and recurrence 6
1.3. Atrial fibrillation progression phenotypes 7
1.3.1. Atrial fibrillation type 7
1.3.2. Left atrial diameter 8
1.3.3. Low voltage areas 8
1.3.4. PR interval 8
1.4. Pathomechanisms of atrial fibrillation progression 9
1.4.1. Electrical remodeling 9
1.4.2. Structural remodeling and fibrosis 10
1.4.3. Autonomic remodeling 11
1.5. Genetic background of atrial fibrillation 11
1.5.1. Heritable atrial fibrillation and the impact of rare variants 12
1.5.2. Genetic predisposition and the impact of common variants 12
1.5.3. New concepts for GWAS analysis of genetic background 13
1.6. Personalized medicine 15
1.6.1. Clinical scores for risk prediction in atrial fibrillation 15
1.6.2. Clinical scores for prediction of AF progression and recurrence 15
1.6.3. New concepts in personalized medicine 16
1.7.1. Schematic overview on AF progression and associated open questions 16
2. Hypotheses 18
3. Publications 19
3.1. New concepts on genetic background of AF progression and recurrence 19
3.1.1. Genomic contributors to atrial electroanatomical remodeling and atrial
fibrillation progression: Pathway enrichment analysis of GWAS data.
(Publication 1) 20
3.1.2. Genomic Contributors to Rhythm Outcome of Atrial Fibrillation Catheter
Ablation - Pathway Enrichment Analysis of GWAS Data. (Publication 2) 28
3.1.3. Identification of Central Regulators of Calcium Signaling and
ECM-Receptor Interaction Genetically Associated With the Progression
and Recurrence of Atrial Fibrillation. (Publication 3) 40
3.1.4. Association of atrial fibrillation susceptibility genes, atrial fibrillation
phenotypes and response to catheter ablation: a gene-based analysis of GWAS
data. (Publication 4) 47
3.1.5. PR Interval Associated Genes, Atrial Remodeling and Rhythm Outcome
of Catheter Ablation of Atrial Fibrillation—A Gene-Based Analysis of
GWAS Data. (Publication 5) 54
3.2. New concepts on biomarkers of AF progression and recurrence
3.2.1. Role of NT-proANP and NT-proBNP in patients with atrial fibrillation:
Association with atrial fibrillation progression phenotypes. (Publication 6) 61
3.2.2. Prediction of electro-anatomical substrate using APPLE score and
biomarkers. (Publication 7) 68
4. Conclusions 75
5. References 76
6. Abbreviations 86
7. Erklärungen zur vorgelegten Habilitationsschrift 87
8. Lebenslauf 88
9. Danksagung 91
|
9 |
Unsupervised and semi-supervised training methods for eukaryotic gene predictionTer-Hovhannisyan, Vardges 17 November 2008 (has links)
This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing.
Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns.
The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments.
Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.
|
10 |
Machine learning methods for genomic high-content screen data analysis applied to deduce organization of endocytic networkNikitina, Kseniia 13 July 2023 (has links)
High-content screens are widely used to get insight on mechanistic organization of biological systems. Chemical and/or genomic interferences are used to modulate molecular machinery, then light microscopy and quantitative image analysis yield a large number of parameters describing phenotype. However, extracting functional information from such high-content datasets (e.g. links between cellular processes or functions of unknown genes) remains challenging. This work is devoted to the analysis of a multi-parametric image-based genomic screen of endocytosis, the process whereby cells uptake cargoes (signals and nutrients) and distribute them into different subcellular compartments. The complexity of the quantitative endocytic data was approached using different Machine Learning techniques, namely, Clustering methods, Bayesian networks, Principal and Independent component analysis, Artificial neural networks. The main goal of such an analysis is to predict possible modes of action of screened genes and also to find candidate genes that can be involved in a process of interest. The degree of freedom for the multidimensional phenotypic space was identified using the data distributions, and then the high-content data were deconvolved into separate signals from different cellular modules. Some of those basic signals (phenotypic traits) were straightforward to interpret in terms of known molecular processes; the other components gave insight into interesting directions for further research. The phenotypic profile of perturbation of individual genes are sparse in coordinates of the basic signals, and, therefore, intrinsically suggest their functional roles in cellular processes. Being a very fundamental process, endocytosis is specifically modulated by a variety of different pathways in the cell; therefore, endocytic phenotyping can be used for analysis of non-endocytic modules in the cell. Proposed approach can be also generalized for analysis of other high-content screens.:Contents
Objectives
Chapter 1 Introduction
1.1 High-content biological data
1.1.1 Different perturbation types for HCS
1.1.2 Types of observations in HTS
1.1.3 Goals and outcomes of MP HTS
1.1.4 An overview of the classical methods of analysis of biological HT- and HCS data
1.2 Machine learning for systems biology
1.2.1 Feature selection
1.2.2 Unsupervised learning
1.2.3 Supervised learning
1.2.4 Artificial neural networks
1.3 Endocytosis as a system process
1.3.1 Endocytic compartments and main players
1.3.2 Relation to other cellular processes
Chapter 2 Experimental and analytical techniques
2.1 Experimental methods
2.1.1 RNA interference
2.1.2 Quantitative multiparametric image analysis
2.2 Detailed description of the endocytic HCS dataset
2.2.1 Basic properties of the endocytic dataset
2.2.2 Control subset of genes
2.3 Machine learning methods
2.3.1 Latent variables models
2.3.2 Clustering
2.3.3 Bayesian networks
2.3.4 Neural networks
Chapter 3 Results
3.1 Selection of labeled data for training and validation based on KEGG information about genes pathways
3.2 Clustering of genes
3.2.1 Comparison of clustering techniques on control dataset
3.2.2 Clustering results
3.3 Independent components as basic phenotypes
3.3.1 Algorithm for identification of the best number of independent components
3.3.2 Application of ICA on the full dataset and on separate assays of the screen
3.3.3 Gene annotation based on revealed phenotypes
3.3.4 Searching for genes with target function
3.4 Bayesian network on endocytic parameters
3.4.1 Prediction of pathway based on parameters values using Naïve Bayesian Classifier
3.4.2 General Bayesian Networks
3.5 Neural networks
3.5.1 Autoencoders as nonlinear ICA
3.5.2 siRNA sequence motives discovery with deep NN
3.6 Biological results
3.6.1 Rab11 ZNF-specific phenotype found by ICA
3.6.2 Structure of BN revealed dependency between endocytosis and cell adhesion
Chapter 4 Discussion
4.1 Machine learning approaches for discovery of phenotypic patterns
4.1.1 Functional annotation of unknown genes based on phenotypic profiles
4.1.2 Candidate genes search
4.2 Adaptation to other HCS data and generalization
Chapter 5 Outlook and future perspectives
5.1 Handling sequence-dependent off-target effects with neural networks
5.2 Transition between machine learning and systems biology models
Acknowledgements
References
Appendix
A.1 Full list of cellular and endocytic parameters
A.2 Description of independent components of the full dataset
A.3 Description of independent components extracted from separate assays of the HCS
|
Page generated in 0.1062 seconds