Spelling suggestions: "subject:"random variables -- data processing"" "subject:"random variables -- mata processing""
1 |
Information-theoretic variable selection and network inference from microarray dataMeyer, Patrick E. 16 December 2008 (has links)
Statisticians are used to model interactions between variables on the basis of observed<p>data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets<p>having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of<p>samples. The detection of functional relationships, when such uncertainty is contained in<p>data, constitutes a major challenge.<p>Our work focuses on variable selection and network inference from datasets having<p>many variables and few samples (high variable-to-sample ratio), such as microarray data.<p>Variable selection is the topic of machine learning whose objective is to select, among a<p>set of input variables, those that lead to the best predictive model. The application of<p>variable selection methods to gene expression data allows, for example, to improve cancer<p>diagnosis and prognosis by identifying a new molecular signature of the disease. Network<p>inference consists in representing the dependencies between the variables of a dataset by<p>a graph. Hence, when applied to microarray data, network inference can reverse-engineer<p>the transcriptional regulatory network of cell in view of discovering new drug targets to<p>cure diseases.<p>In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset<p>Information for Variable Elimination) a new method of feature selection and MRNET (Minimum<p>Redundancy NETwork), a new algorithm of network inference. Both tools rely on<p>the computation of mutual information, an information-theoretic measure of dependency.<p>More precisely, MASSIVE and MRNET use approximations of the mutual information<p>between a subset of variables and a target variable based on combinations of mutual informations<p>between sub-subsets of variables and the target. The used approximations allow<p>to estimate a series of low variate densities instead of one large multivariate density. Low<p>variate densities are well-suited for dealing with high variable-to-sample ratio datasets,<p>since they are rather cheap in terms of computational cost and they do not require a large<p>amount of samples in order to be estimated accurately. Numerous experimental results<p>show the competitiveness of these new approaches. Finally, our thesis has led to a freely<p>available source code of MASSIVE and an open-source R and Bioconductor package of<p>network inference. / Doctorat en sciences, Spécialisation Informatique / info:eu-repo/semantics/nonPublished
|
Page generated in 0.1536 seconds