Sequential data are encountered in many contexts of everyday life and in numerous scientific applications. They can for instance be SMS typeset on mobile phones, web pages reached while crossing hyperlinks, system logs or DNA samples, to name a few. Generating such data defines a sequential process.
This thesis is concerned with the modeling of sequential processes from observed data. Sequential processes are here modeled using probabilistic models, namely discrete time Markov chains (MC), Hidden Markov Models (HMMs) and Partially Observable Markov Models (POMMs). Such models can answer questions such as (i) Which event will occur a couple of steps later? (ii) How many times will a particular event occur? and (iii) When does an event occur for the first time given the current situation?
The last question is of particular interest in this thesis and is mathematically formalized under the notion of First Passage Times (FPT) dynamics of a process. The FPT dynamics is used here to solve the three following problems related to machine learning and data mining: (i) HMM/POMM induction, (ii) supervised sequence classification and (iii) relevant subgraph mining.
Firstly, we propose a novel algorithm, called POMMStruct, for learning the structure and the parameters of POMMs to best fit the empirical FPT dynamics observed in the samples. Experimental results illustrate the benefit of POMMStruct in the modeling of sequential processes with a complex temporal dynamics while compared to classical induction approaches. Our second contribution is concerned with the classification of sequences. We propose to model the FPT in sequences with discrete phase-type (PH) distributions using a novel algorithm called PHit. These distributions are used to devise a new string kernel and a probabilistic classifier. Experimental results on biological data shows that our methods provides state-of-the-art classification results. Finally, we address the problem of mining subgraphs, which are relevant to model the relationships between selected nodes of interest, in large graphs such as biological networks. A new relevance criterion based on particular random walks called K-walks is proposed as well as efficient algorithms to compute this criterion. Experiments on the KEGG metabolic network and on randomly generated graphs are presented.
Identifer | oai:union.ndltd.org:BICfB/oai:ucl.ac.be:ETDUCL:BelnUcetd-10112007-214112 |
Date | 12 October 2007 |
Creators | Callut, Jérôme |
Publisher | Universite catholique de Louvain |
Source Sets | Bibliothèque interuniversitaire de la Communauté française de Belgique |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10112007-214112/ |
Rights | unrestricted, J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus. |
Page generated in 0.0026 seconds