Global ETD Search

Return to search

Développement de méthodes de fouille de données basées sur les modèles de Markov cachés du second ordre pour l'identification d'hétérogénéités dans les génomes bactériens / Data Mining methods based on second-order Hidden Markov Models to identify heterogeneities into bacteria genomes

Les modèles de Markov d’ordre 2 (HMM2) sont des modèles stochastiques qui ont démontré leur efficacité dans l’exploration de séquences génomiques. Cette thèse explore l’intérêt de modèles de différents types (M1M2, M2M2, M2M0) ainsi que leur couplage à des méthodes combinatoires pour segmenter les génomes bactériens sans connaissances a priori du contenu génétique. Ces approches ont été appliquées à deux modèles bactériens afin d’en valider la robustesse : Streptomyces coelicolor et Streptococcus thermophilus. Ces espèces bactériennes présentent des caractéristiques génomiques très distinctes (composition, taille du génome) en lien avec leur écosystème spécifique : le sol pour les S. coelicolor et le milieu lait pour S. thermophilus / Second-order Hidden Markov Models (HMM2) are stochastic processes with a high efficiency in exploring bacterial genome sequences. Different types of HMM2 (M1M2, M2M2, M2M0) combined to combinatorial methods were developed in a new approach to discriminate genomic regions without a priori knowledge on their genetic content. This approach was applied on two bacterial models in order to validate its achievements: Streptomyces coelicolor and Streptococcus thermophilus. These bacterial species exhibit distinct genomic traits (base composition, global genome size) in relation with their ecological niche: soil for S. coelicolor and dairy products for S. thermophilus. In S. coelicolor, a first HMM2 architecture allowed the detection of short discrete DNA heterogeneities (5-16 nucleotides in size), mostly localized in intergenic regions. The application of the method on a biologically known gene set, the SigR regulon (involved in oxidative stress response), proved the efficiency in identifying bacterial promoters. S. coelicolor shows a complex regulatory network (up to 12% of the genes may be involved in gene regulation) with more than 60 sigma factors, involved in initiation of transcription. A classification method coupled to a searching algorithm (i.e. R’MES) was developed to automatically extract the box1-spacer-box2 composite DNA motifs, structure corresponding to the typical bacterial promoter -35/-10 boxes. Among the 814 DNA motifs described for the whole S. coelicolor genome, those of sigma factors (B, WhiG) could be retrieved from the crude data. We could show that this method could be generalized by applying it successfully in a preliminary attempt to the genome of Bacillus subtilis

http://www.theses.fr/2010NAN10041/document

Bioinformatique

Fouille de données

Modèle de Markov du second ordre

Approche stochastique et combinatoire

Transfert horizontal de gènes

Streptomyces coelicolor

Streptococcus thermophilus

Bioinformatics

Data mining

Second order hidden Markov model

Transcriptional factor binding site

Stochastic and combinatorial approach

Horizontal gene transfer

Streptomyces coelicolor

Streptococcus thermophilus

Identifer	oai:union.ndltd.org:theses.fr/2010NAN10041
Date	15 June 2010
Creators	Eng, Catherine
Contributors	Nancy 1, Université de Metz, Leblond, Pierre, Mari, Jean-françois
Source Sets	Dépôt national des thèses électroniques françaises
Language	French
Detected Language	English
Type	Electronic Thesis or Dissertation, Text

Page generated in 0.0016 seconds

Développement de méthodes de fouille de données basées sur les modèles de Markov cachés du second ordre pour l'identification d'hétérogénéités dans les génomes bactériens / Data Mining methods based on second-order Hidden Markov Models to identify heterogeneities into bacteria genomes

Description

Links & Downloads

Tags

Additional Fields