Statisticians are used to model interactions between variables on the basis of observed<p>data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets<p>having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of<p>samples. The detection of functional relationships, when such uncertainty is contained in<p>data, constitutes a major challenge.<p>Our work focuses on variable selection and network inference from datasets having<p>many variables and few samples (high variable-to-sample ratio), such as microarray data.<p>Variable selection is the topic of machine learning whose objective is to select, among a<p>set of input variables, those that lead to the best predictive model. The application of<p>variable selection methods to gene expression data allows, for example, to improve cancer<p>diagnosis and prognosis by identifying a new molecular signature of the disease. Network<p>inference consists in representing the dependencies between the variables of a dataset by<p>a graph. Hence, when applied to microarray data, network inference can reverse-engineer<p>the transcriptional regulatory network of cell in view of discovering new drug targets to<p>cure diseases.<p>In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset<p>Information for Variable Elimination) a new method of feature selection and MRNET (Minimum<p>Redundancy NETwork), a new algorithm of network inference. Both tools rely on<p>the computation of mutual information, an information-theoretic measure of dependency.<p>More precisely, MASSIVE and MRNET use approximations of the mutual information<p>between a subset of variables and a target variable based on combinations of mutual informations<p>between sub-subsets of variables and the target. The used approximations allow<p>to estimate a series of low variate densities instead of one large multivariate density. Low<p>variate densities are well-suited for dealing with high variable-to-sample ratio datasets,<p>since they are rather cheap in terms of computational cost and they do not require a large<p>amount of samples in order to be estimated accurately. Numerous experimental results<p>show the competitiveness of these new approaches. Finally, our thesis has led to a freely<p>available source code of MASSIVE and an open-source R and Bioconductor package of<p>network inference. / Doctorat en sciences, Spécialisation Informatique / info:eu-repo/semantics/nonPublished
Identifer | oai:union.ndltd.org:ulb.ac.be/oai:dipot.ulb.ac.be:2013/210396 |
Date | 16 December 2008 |
Creators | Meyer, Patrick E. |
Contributors | Bontempi, Gianluca, Cardinal, Jean, Rossi, Fabrice, Verleysen, Michel, Gardner, Timothy, Lenaerts, Tom |
Publisher | Universite Libre de Bruxelles, Université libre de Bruxelles, Faculté des Sciences – Informatique, Bruxelles |
Source Sets | Université libre de Bruxelles |
Language | French |
Detected Language | English |
Type | info:eu-repo/semantics/doctoralThesis, info:ulb-repo/semantics/doctoralThesis, info:ulb-repo/semantics/openurl/vlink-dissertation |
Format | 1 v., No full-text files |
Page generated in 0.0025 seconds