101 |
A robust algorithm for segmenting fluorescence images and its application to single-molecule countingBoisvert, Jacques 12 1900 (has links)
La microscopie par fluorescence de cellules vivantes produit de grandes quantités de
données. Ces données sont composées d’une grande diversité au niveau de la forme des
objets d’intérêts et possèdent un ratio signaux/bruit très bas. Pour concevoir un pipeline
d’algorithmes efficaces en traitement d’image de microscopie par fluorescence, il
est important d’avoir une segmentation robuste et fiable étant donné que celle-ci constitue
l’étape initiale du traitement d’image. Dans ce mémoire, je présente MinSeg, un
algorithme de segmentation d’image de microscopie par fluorescence qui fait peu d’assomptions
sur l’image et utilise des propriétés statistiques pour distinguer le signal par
rapport au bruit. MinSeg ne fait pas d’assomption sur la taille ou la forme des objets
contenus dans l’image. Par ce fait, il est donc applicable sur une grande variété d’images.
Je présente aussi une suite d’algorithmes pour la quantification de petits complexes dans
des expériences de microscopie par fluorescence de molécules simples utilisant l’algorithme
de segmentation MinSeg. Cette suite d’algorithmes a été utilisée pour la quantification
d’une protéine nommée CENP-A qui est une variante de l’histone H3. Par cette
technique, nous avons trouvé que CENP-A est principalement présente sous forme de
dimère. / Live-cell fluorescence microscopy produces high amounts of data with a high variability
in shapes at low signal-to-noise ratio. An efficient design of image analysis
pipelines requires a reliable and robust initial segmentation step that needs little parameter
fine-tuning. Here, I present a segmentation algorithm called MinSeg for fluorescence
image data that relies on minimal assumptions about the image, and uses statistical considerations
to distinguish signal from background. More importantly, the algorithm does
not make assumptions about feature size or shape, and is thus universally applicable. I
also present a pipeline for the quantification of small complexes with single-molecule
fluorescence microscopy using this segmentation algorithm as the first step of the workflow.
This pipeline was used for the quantification of a small histone H3 variant protein
called CENP-A. We found that the CENP-A nucleosomes are dimers.
|
102 |
Développement d’une méthode bio-informatique permettant de relier les gènes aux métabolitesCherkaoui, Sarah 12 1900 (has links)
L’objectif de ce projet était de faire le lien entre gènes et métabolites afin d’éventuellement proposer des métabolites à mesurer en lien avec la fonction de gènes. Plus particulièrement, nous nous sommes intéressés aux gènes codant pour des protéines ayant un impact sur le métabolisme, soit les enzymes qui catalysent les réactions faisant partie intégrante des voies métaboliques. Afin de quantifier ce lien, nous avons développé une méthode bio-informatique permettant de calculer la distance qui est définie comme le nombre de réactions entre l’enzyme encodée par le gène et le métabolite dans la carte globale du métabolisme de la base de données Kyoto Encyclopedia of Genes and Genomes (KEGG). Notre hypothèse était que les métabolites d’intérêt sont des substrats/produits se trouvant à proximité des réactions catalysées par l’enzyme encodée par le gène. Afin de tester cette hypothèse et de valider la méthode, nous avons utilisé les études d’association pangénomique combinées à la métabolomique (mGWAS) car elles rapportent des associations entre variants génétiques, annotés en gènes, et métabolites mesurés. Plus précisément, la méthode a été appliquée à l’étude mGWAS par Shin et al. Bien que la couverture des associations de Shin et al. était limitée (24/299), nous avons pu valider de façon significative la proximité entre gènes et métabolites associés (P<0,01). En somme, cette méthode et ses développements futurs permettront d’interpréter de façon quantitative les associations mGWAS, de prédire quels métabolites mesurer en lien avec la fonction d’un gène et, plus généralement, de permettre une meilleure compréhension du contrôle génétique sur le métabolisme. / The objective of this project was to link genes and metabolites in order to ultimately predict which metabolites to measure in order to adequately reflect the function of a given gene. Specifically, we were interested in genes, which code for proteins that regulate substrate metabolism, hence enzymes that catalyze reactions that are part of metabolic pathways. In order to quantify this link, we have developed a bioinformatics method to calculate a distance, which is defined as the number of reactions separating a given selected gene-encoded enzyme and its metabolite of interest in Kyoto Encyclopedia of Genes and Genomes (KEGG) database’s metabolic overview map. Our hypothesis was that metabolites of interest are products/substrates found at proximity of the reactions catalyzed by the selected gene-encoded enzyme. In order to test our hypothesis and validate the method, we have used genome-wide association study of metabolites levels (mGWAS) because these studies report associations between genetic variants, annotated to genes, and measured metabolites. More specifically, we used the mGWAS conducted by Shin et al. Even though the coverage of the associations reported by Shin et al. was limited (24/299), we significantly validated the proximity between gene-metabolite associated pairs (P<0.01). Overall, the method and its future developments will allow the quantitative interpretation of mGWAS associations, predict which metabolite to measure with regards to the function of a gene and, in general, enable a better understanding of the genetic control of metabolism.
|
103 |
Identification de caractéristiques communes et rares dans les ARN structurés dans la base de données RfamEl Korbi, Amell 08 1900 (has links)
Les ARN non codants (ARNnc) sont des transcrits d'ARN qui ne sont pas traduits en protéines et qui pourtant ont des fonctions clés et variées dans la cellule telles que la régulation des gènes, la transcription et la traduction. Parmi les nombreuses catégories d'ARNnc qui ont été découvertes, on trouve des ARN bien connus tels que les ARN ribosomiques (ARNr), les ARN de transfert (ARNt), les snoARN et les microARN (miARN). Les fonctions des ARNnc sont étroitement liées à leurs structures d’où l’importance de développer des outils de prédiction de structure et des méthodes de recherche de nouveaux ARNnc. Les progrès technologiques ont mis à la disposition des chercheurs des informations abondantes sur les séquences d'ARN. Ces informations sont accessibles dans des bases de données telles que Rfam, qui fournit des alignements et des informations structurelles sur de nombreuses familles d'ARNnc. Dans ce travail, nous avons récupéré toutes les séquences des structures secondaires annotées dans Rfam, telles que les boucles en épingle à cheveux, les boucles internes, les renflements « bulge », etc. dans toutes les familles d'ARNnc. Une base de données locale, RNAstem, a été créée pour faciliter la manipulation et la compilation des données sur les motifs de structure secondaire. Nous avons analysé toutes les boucles terminales et internes ainsi que les « bulges » et nous avons calculé un score d’abondance qui nous a permis d’étudier la fréquence de ces motifs. Tout en minimisant le biais de la surreprésentation de certaines classes d’ARN telles que l’ARN ribosomal, l’analyse des scores a permis de caractériser les motifs rares pour chacune des catégories d’ARN en plus de confirmer des motifs communs comme les boucles de type GNRA ou UNCG. Nous avons identifié des motifs abondants qui n’ont pas été étudiés auparavant tels que la « tetraloop » UUUU. En analysant le contenu de ces motifs en nucléotides, nous avons remarqué que ces régions simples brins contiennent beaucoup plus de nucléotides A et U. Enfin, nous avons exploré la possibilité d’utiliser ces scores pour la conception d’un filtre qui permettrait d’accélérer la recherche de nouveaux ARN non-codants. Nous avons développé un système de scores, RNAscore, qui permet d’évaluer un ARN en se basant sur son contenu en motifs et nous avons testé son applicabilité avec différents types de contrôles. / Noncoding RNAs (ncRNAs) are RNA transcripts that are not translated into proteins yet they play important functional roles in the cell including gene regulation, transcription and translation. Among the many categories of ncRNAs that were discovered, we find the well-known ribosomal RNA (rRNA), transfer RNA (tRNA), snoRNA and microRNAs (miRNA). The functions of ncRNAs are tightly linked to their structural features. Thus, understanding and predicting RNA structure as well as developing methods to search for new ncRNAs help to gain insight into these molecules. Technological advances have made available abundant sequence information accessible in databases such as Rfam, which provides alignments and structural information of many ncRNA families. In this research project, we retrieved the information from the Rfam database about the sequences of all secondary structures such as hairpin loops, internal loops, bulges, etc. in all RNA families. A local database, RNAstem, was created to facilitate the use and manipulation of information about secondary structure motifs. We analyzed hairpin loops, bulges and internal loops using the compiled data about the frequencies of occurrence of each loop or bulge and calculated a frequency score. The frequency score is aimed to be an indicator for the abundance of a specific secondary structure motif. While minimizing the bias caused by the high redundancy of some RNA classes as ribosomal RNAs, the frequency score allowed us to identify the rare motifs in each category as well as the common ones. Our findings about the abundant motifs confirm what is already known from previous studies (ex. abundant GNRA or UNCG tetraloops). We found very large gaps between the most abundant and rare RNA structural features. Moreover, we discovered that "A" and "U" dominate single stranded RNA regions, whether they are bulges or loops. We further explored the possibility of using this data to improve current prediction tools for ncRNAs by applying a filter to new candidates. We developed a score system, RNAscore, that evaluates RNAs depending on their motif contents and we tested the program with many different controls.
|
104 |
Neighborhood-Oriented feature selection and classification of Duke’s stages on colorectal Cancer using high density genomic data.Peng, Liang January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / The selection of relevant genes for classification of phenotypes for diseases with gene expression data have been extensively studied. Previously, most relevant gene selection was
conducted on individual gene with limited sample size. Modern technology makes it possible to obtain microarray data with higher resolution of the chromosomes. Considering gene
sets on an entire block of a chromosome rather than individual gene could help to reveal important connection of relevant genes with the disease phenotypes. In this report, we consider feature selection and classification while taking into account of the spatial location of probe sets in classification of Duke’s stages B and C using DNA copy number data or gene expression data from colorectal cancers. A novel method was presented for feature selection in this report. A chromosome was first partitioned into blocks after the probe sets were aligned along their chromosome locations. Then a test of interaction between Duke’s stage and probe sets was conducted on each block of probe sets to select significant blocks. For each significant block, a new multiple comparison procedure was carried out to identify truly relevant probe sets while preserving the neighborhood location information of the
probe sets. Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) classification
using the selected final probe sets was conducted for all samples. Leave-One-Out Cross Validation (LOOCV) estimate of accuracy is reported as an evaluation of selected features. We applied the method on two large data sets, each containing more than 50,000 features. Excellent classification accuracy was achieved by the proposed procedure along with SVM or KNN for both data sets even though classification of prognosis stages (Duke’s stages B and C) is much more difficult than that for the normal or tumor types.
|
105 |
Genetic study of resistance to charcoal rot and Fusarium stalk rot diseases of sorghumAdeyanju, Adedayo January 1900 (has links)
Doctor of Philosophy / Department of Agronomy / Tesfaye Tesso / Fusarium stalk rot and charcoal rot caused by Fusarium thapsinum and Macrophomina phaseolina respectively are devastating global diseases in sorghum that lead to severe quality and yield loss each year. In this study, three sets of interrelated experiments were conducted that will potentially lead to the development of resistance based control option to these diseases.
The first experiment was aimed at identifying sources of resistance to infection by M. phaseolina and F. thapsinum in a diverse panel of 300 sorghum genotypes. The genotypes were evaluated in three environments following artificial inoculation. Out of a total of 300 genotypes evaluated, 95 genotypes were found to have resistance to M. phaseolina and 77 to F. thapsinum of which 53 genotypes were resistant to both pathogens.
In the second experiment, a set of 79,132 single nucleotide polymorphisms (SNPs) markers were used in an association study to identify genomic regions underlying stalk rot resistance using a multi-locus mixed model association mapping approach. We identified 14 loci associated with stalk rot and a set of candidate genes that appear to be involved in connected functions controlling plant defense response to stalk rot resistance. The associated SNPs accounted for 19-30% of phenotypic variation observed within and across environments. An analysis of associated allele frequencies within the major sorghum subpopulations revealed enrichment for resistant alleles in the durra and caudatum subpopulations compared with other subpopulations. The findings suggest a complicated molecular mechanism of resistance to stalk rots.
The objective of the third experiment was to determine the functional relationship between stay-green trait, leaf dhurrin and soluble sugar levels and resistance to stalk rot diseases. Fourteen genotypic groups derived from a Tx642 × Tx7000 RIL population carrying combinations of stay-green quantitative trait loci were evaluated under three environments in four replications. The stg QTL had variable effects on stalk rot disease. Genotypes carrying stg1, stg3, stg1,3 and stg1,2,3,4 expressed good levels of resistance to M. phaseolina but the combination of stg1 and stg3 was required to express the same level of resistance to F. thapsinum. Other stg QTL blocks such as stg2 and stg4 did not have any impact on stalk rot resistance caused by both pathogens. There were no significant correlations between leaf dhurrin, soluble sugar concentration, and resistance to any of the pathogens.
|
106 |
Semi-supervised and transductive learning algorithms for predicting alternative splicing events in genes.Tangirala, Karthik January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / As genomes are sequenced, a major challenge is their annotation -- the identification of genes and regulatory elements, their locations and their functions. For years, it was believed that one gene corresponds to one protein, but the discovery of alternative splicing provided a mechanism for generating different gene transcripts (isoforms) from the same genomic sequence. In the recent years, it has become obvious that a large fraction of genes undergoes alternative splicing. Thus, understanding alternative splicing is a problem of great interest to biologists. Supervised machine learning approaches can be used to predict alternative splicing events at genome level. However, supervised approaches require large amounts of labeled data to produce accurate classifiers. While large amounts of genomic data are produced by the new sequencing technologies, labeling these data can be costly and time consuming. Therefore, semi-supervised learning approaches that can make use of large amounts of unlabeled data, in addition to small amounts of labeled data are highly desirable. In this work, we study the usefulness of a semi-supervised learning approach, co-training, for classifying exons as alternatively spliced or constitutive. The co-training algorithm makes use of two views of the data to iteratively learn two classifiers that can inform each other, at each step, with their best predictions on the unlabeled data. We consider three sets of features for constructing views for the problem of predicting alternatively spliced exons: lengths of the exon of interest and its flanking introns, exonic splicing enhancers (a.k.a., ESE motifs) and intronic regulatory sequences (a.k.a., IRS motifs). Naive Bayes and Support Vector Machine (SVM) algorithms are used as based classifiers in our study. Experimental results show that the usage of the unlabeled data can result in better classifiers as compared to those obtained from the small amount of labeled data alone. In addition to semi-supervised approaches, we also also study the usefulness of graph based transductive learning approaches for predicting alternatively spliced exons. Similar to the semi-supervised learning algorithms, transductive learning algorithms can make use of unlabeled data, together with labeled data, to produce labels for the unlabeled data. However, a classification model that could be used to classify new unlabeled data is not learned in this case. Experimental results show that graph based transductive approaches can make effective use of the unlabeled data.
|
107 |
Tribolium castaneum genes encoding proteins with the chitin-binding type II domain.Jasrapuria, Sinu January 1900 (has links)
Doctor of Philosophy / Department of Biochemistry / Subbarat Muthukrishnan / The extracellular matrices of cuticle and peritrophic matrix of insects are composed mainly of chitin complexed with proteins, some of which contain chitin-binding domains. This study is focused on the identification and functional characterization of genes encoding proteins that possess one or more copies of the six-cysteine-containing ChtBD2 domain (Peritrophin A motif =CBM_14 =Pfam 01607) in the red flour beetle, Tribolium castaneum. A bioinformatics search of T. castaneum genome yielded previously characterized chitin metabolic enzymes and several additional proteins. Using phylogenetic analyses, the exon-intron organization of the corresponding genes, domain organization of proteins, and temporal and tissue-specificity of expression patterns, these proteins were classified into three large families. The first family includes 11 proteins essentially made up of 1 to 14 repeats of the peritrophin A domain. Transcripts for these proteins are expressed only in the midgut and only during feeding stages of development. We therefore denote these proteins as “Peritrophic Matrix Proteins” or PMPs. The genes of the second and third families are expressed in cuticle-forming tissues throughout all stages of development but not in the midgut. These two families have been denoted as “Cuticular Proteins Analogous to Peritrophins 3” or CPAP3s and “Cuticular Proteins Analogous to Peritophins 1” or CPAP1s based on the number of ChtBD2 domains that they contain. Unlike other cuticular proteins studied so far, TcCPAP1-C protein is localized predominantly in the exocuticle and could contribute to the unique properties of this cuticular layer. RNA interference (RNAi), which down-regulates transcripts for any targeted gene, results in lethal and/or abnormal phenotypes for some, but not all, of these genes. Phenotypes are often unique and are manifested at different developmental stages, including embryonic, pupal and/or adult stages. The
experiments presented in this dissertation reveal that while the vast majority of the CPAP3 genes serve distinct and essential functions affecting survival, molting or normal cuticle development. However, a minority of the CPAP1 and PMP family genes are indispensable for survival under laboratory conditions. Some of the non-essential genes may have functional redundancy or may be needed only under special circumstances such as exposure to stress or pathogens.
|
108 |
Principes de l’évolution du réseau de l’homéostasie des protéinesDraceni, Yasmine 12 1900 (has links)
No description available.
|
109 |
Mapping SH3 Domain InteractomesXin, Xiaofeng 21 April 2010 (has links)
Src homology 3 (SH3) domains are one family of the peptide recognition modules (PRMs), which bind peptides rich in proline or positively charged residues in the target proteins, and play important assembly or regulatory functions in dynamic eukaryotic cellular processes, especially in signal transduction and endocytosis. SH3 domains are conserved from yeast to human, and improper SH3 domain mediated protein-protein interaction (PPI) leads to defects in cellular function and may even result in disease states. Since commonly used large-scale PPI mapping strategies employed full-length proteins or random protein fragments as screening probes and did not identify the particular PPIs mediated by the SH3 domains, I employed a combined experimental and computational strategy to address this problem.
I used yeast two-hybrid (Y2H) as my major experimental tool, as well as individual SH3 domains as baits, to map SH3 domain mediated PPI networks, “SH3 domain interactomes”. One of my important contributions has been the improvement for Y2H technology. First, I generated a pair of Y2H host strains that improved the efficiency of high-throughput Y2H screening and validated their usage. These strains were employed in my own research and also were adopted by other researchers in their large-scale PPI network mapping projects. Second, in collaboration with Nicolas Thierry-Mieg, I developed a novel smart-pooling method, Shifted Transversal Design (STD) pooling, and validated its application in large-scale Y2H. STD pooling was proven to be superior among currently available methods for obtaining large-scale PPI maps with higher coverage, high sensitivity and high specificity.
I mapped the SH3 domain interactomes for both budding yeast Saccharomyces cerevisiae and nematode worm Caenorhabditis elegans, which contain 27 and 84 SH3 domains, respectively. Comparison of these two SH3 interactomes revealed that the role of the SH3 domain is conserved at a functional but not a structural level, playing a major role in the assembly of an endocytosis network from yeast to worm. Moreover, the worm SH3 domains are additionally involved in metazoan-specific functions such as neurogenesis and vulval development. These results provide valuable insights for our understanding of two important evolutionary processes from single cellular eukaryotes to animals: the functional expansion of the SH3 domains into new cellular modules, as well as the conservation and evolution of some cellular modules at the molecular level, particularly the endocytosis module.
|
110 |
Computational Prediction of Gene Function From High-throughput Data SourcesMostafavi, Sara 31 August 2011 (has links)
A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the prediction of function for uncharacterized genes by combining multiple datasets, in order to exploit the associations between such genes and genes of known function--all in a query-specific fashion.
Commonly, heterogeneous datasets are represented as networks in order to facilitate their combination. Here, I show that it is possible to accurately predict gene function in seconds by combining multiple large-scale networks. This facilitates function prediction on-demand, allowing users to take advantage of the persistent improvement and proliferation of genomics and proteomics datasets and continuously make up-to-date predictions for large genomes such as humans.
Our algorithm, GeneMANIA, uses constrained linear regression to combine multiple association networks and uses label propagation to make predictions from the combined network. I introduce extensions that result in improved predictions when the number of labeled examples for training is limited, or when an ontological structure describing a hierarchy of gene function categorization scheme is available. Further, motivated by our empirical observations on predicting node labels for general networks, I propose a new label propagation algorithm that exploits common properties of real-world networks to increase both the speed and accuracy of our predictions.
|
Page generated in 0.0222 seconds