Global ETD Search

1	Graph Kernels and Applications in Bioinformatics Alvarez Vega, Marco 01 May 2011 (has links) In recent years, machine learning has emerged as an important discipline. However, despite the popularity of machine learning techniques, data in the form of discrete structures are not fully exploited. For example, when data appear as graphs, the common choice is the transformation of such structures into feature vectors. This procedure, though convenient, does not always effectively capture topological relationships inherent to the data; therefore, the power of the learning process may be insufficient. In this context, the use of kernel functions for graphs arises as an attractive way to deal with such structured objects. On the other hand, several entities in computational biology applications, such as gene products or proteins, may be naturally represented by graphs. Hence, the demanding need for algorithms that can deal with structured data poses the question of whether the use of kernels for graphs can outperform existing methods to solve specific computational biology problems. In this dissertation, we address the challenges involved in solving two specific problems in computational biology, in which the data are represented by graphs. First, we propose a novel approach for protein function prediction by modeling proteins as graphs. For each of the vertices in a protein graph, we propose the calculation of evolutionary profiles, which are derived from multiple sequence alignments from the amino acid residues within each vertex. We then use a shortest path graph kernel in conjunction with a support vector machine to predict protein function. We evaluate our approach under two instances of protein function prediction, namely, the discrimination of proteins as enzymes, and the recognition of DNA binding proteins. In both cases, our proposed approach achieves better prediction performance than existing methods. Second, we propose two novel semantic similarity measures for proteins based on the gene ontology. The first measure directly works on the gene ontology by combining the pairwise semantic similarity scores between sets of annotating terms for a pair of input proteins. The second measure estimates protein semantic similarity using a shortest path graph kernel to take advantage of the rich semantic knowledge contained within ontologies. Our comparison with other methods shows that our proposed semantic similarity measures are highly competitive and the latter one outperforms state-of-the-art methods. Furthermore, our two methods are intrinsic to the gene ontology, in the sense that they do not rely on external sources to calculate similarities. graph kernels bioinformatics Computer Sciences
2	Novel Measures on Directed Graphs and Applications to Large-Scale Within-Network Classification Mantrach, Amin 25 October 2010 (has links) Ces dernières années, les réseaux sont devenus une source importante d’informations dans différents domaines aussi variés que les sciences sociales, la physique ou les mathématiques. De plus, la taille de ces réseaux n’a cessé de grandir de manière conséquente. Ce constat a vu émerger de nouveaux défis, comme le besoin de mesures précises et intuitives pour caractériser et analyser ces réseaux de grandes tailles en un temps raisonnable. La première partie de cette thèse introduit une nouvelle mesure de similarité entre deux noeuds d’un réseau dirigé et pondéré : la covariance “sum-over-paths”. Celle-ci a une interprétation claire et précise : en dénombrant tous les chemins possibles deux noeuds sont considérés comme fortement corrélés s’ils apparaissent souvent sur un même chemin – de préférence court. Cette mesure dépend d’une distribution de probabilités, définie sur l’ensemble infini dénombrable des chemins dans le graphe, obtenue en minimisant l'espérance du coût total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectée dans le réseau est fixée à priori. Le paramètre d’entropie permet de biaiser la distribution de probabilité sur un large spectre : allant de marches aléatoires naturelles où tous les chemins sont équiprobables à des marches biaisées en faveur des plus courts chemins. Cette mesure est alors appliquée à des problèmes de classification semi-supervisée sur des réseaux de taille moyennes et comparée à l’état de l’art. La seconde partie de la thèse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large réseau dont les noeuds sont partiellement étiquetés. Ces algorithmes ont un temps de calcul linéaire en le nombre de noeuds, de classes et d’itérations, et peuvent dés lors être appliqués sur de larges réseaux. Ceux-ci ont obtenus des résultats compétitifs en comparaison à l’état de l’art sur le large réseaux de citations de brevets américains et sur huit autres jeux de données. De plus, durant la thèse, nous avons collecté un nouveau jeu de données, déjà mentionné : le réseau de citations de brevets américains. Ce jeu de données est maintenant disponible pour la communauté pour la réalisation de tests comparatifs. La partie finale de cette thèse concerne la combinaison d’un graphe de citations avec les informations présentes sur ses noeuds. De manière empirique, nous avons montré que des données basées sur des citations fournissent de meilleurs résultats de classification que des données basées sur des contenus textuels. Toujours de manière empirique, nous avons également montré que combiner les différentes sources d’informations (contenu et citations) doit être considéré lors d’une tâche de classification de textes. Par exemple, lorsqu’il s’agit de catégoriser des articles de revues, s’aider d’un graphe de citations extrait au préalable peut améliorer considérablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du réseau de citations, s’aider des informations présentes sur les noeuds n’améliora pas nécessairement les performances. La théorie, les algorithmes et les applications présentés dans cette thèse fournissent des perspectives intéressantes dans différents domaines. In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques. The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e., classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results. The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields. semi-supervised classification large scale graphs betweenness centrality graph kernels
3	Bayesian Optimization for Neural Architecture Search using Graph Kernels Krishnaswami Sreedhar, Bharathwaj January 2020 (has links) Neural architecture search is a popular method for automating architecture design. Bayesian optimization is a widely used approach for hyper-parameter optimization and can estimate a function with limited samples. However, Bayesian optimization methods are not preferred for architecture search as it expects vector inputs while graphs are high dimensional data. This thesis presents a Bayesian approach with Gaussian priors that use graph kernels specifically targeted to work in the higherdimensional graph space. We implemented three different graph kernels and show that on the NAS-Bench-101 dataset, an untrained graph convolutional network kernel outperforms previous methods significantly in terms of the best network found and the number of samples required to find it. We follow the AutoML guidelines to make this work reproducible. / Neural arkitektur sökning är en populär metod för att automatisera arkitektur design. Bayesian-optimering är ett vanligt tillvägagångssätt för optimering av hyperparameter och kan uppskatta en funktion med begränsade prover. Bayesianska optimeringsmetoder är dock inte att föredra för arkitektonisk sökning eftersom vektoringångar förväntas medan grafer är högdimensionella data. Denna avhandling presenterar ett Bayesiansk tillvägagångssätt med gaussiska prior som använder grafkärnor som är särskilt fokuserade på att arbeta i det högre dimensionella grafutrymmet. Vi implementerade tre olika grafkärnor och visar att det på NASBench- 101-data, till och med en otränad Grafkonvolutionsnätverk-kärna, överträffar tidigare metoder när det gäller det bästa nätverket som hittats och antalet prover som krävs för att hitta det. Vi följer AutoML-riktlinjerna för att göra detta arbete reproducerbart. Neural architecture search Bayesian optimization Graph kernels Graph convolutional networks Neural architecture search Bayesian optimization Graph kernels Graph convolutional networks Computer and Information Sciences Data- och informationsvetenskap
4	Sparsity regularization and graph-based representation in medical imaging / La régularisation parcimonieuse et la représentation à base de graphiques dans l'imagerie médicale Gkirtzou, Aikaterini 17 December 2013 (has links) Les images médicales sont utilisées afin de représenter l'anatomie. Le caractère non- linéaire d'imagerie médicale rendent leur analyse difficile. Dans cette thèse, nous nous intéressons à l'analyse d'images médicales du point de vue de la théorie statistique de l'apprentissage. Tout d'abord, nous examinons méthodes de régularisation. Dans cette direction, nous introduisons une nouvelle méthode de régularisation, la k-support regularized SVM. Cet algorithme étend la SVM régularisée `1 à une norme mixte de toutes les deux normes `1 et `2. Ensuite, nous nous intéressons un problème de comparaison des graphes. Les graphes sont une technique utilisée pour la représentation des données ayant une structure héritée. L'exploitation de ces données nécessite la capacité de comparer des graphes. Malgré le progrès dans le domaine des noyaux sur graphes, les noyaux sur graphes existants se concentrent à des graphes non-labellisés ou labellisés de façon discrète, tandis que la comparaison de graphes labellisés par des vecteurs continus, demeure un problème de recherche ouvert. Nous introduisons une nouvelle méthode, l'algorithme de Weisfeiler-Lehman pyramidal et quantifié afin d'aborder le problème de la comparaison des graphes labellisés par des vecteurs continus. Notre algorithme considère les statistiques de motifs sous arbre, basé sur l'algorithme Weisfeiler-Lehman ; il utilise une stratégie de quantification pyramidale pour déterminer un nombre logarithmique de labels discrets. Globalement, les graphes étant des objets mathématiques fondamentaux et les méthodes de régularisation étant utilisés pour contrôler des problèmes mal-posés, notre algorithmes pourraient appliqués sur un grand éventail d'applications. / Medical images have been used to depict the anatomy or function. Their high-dimensionality and their non-linearity nature makes their analysis a challenging problem. In this thesis, we address the medical image analysis from the viewpoint of statistical learning theory. First, we examine regularization methods for analyzing MRI data. In this direction, we introduce a novel regularization method, the k-support regularized Support Vector Machine. This algorithm extends the 1 regularized SVM to a mixed norm of both `1 and `2 norms. We evaluate our algorithm in a neuromuscular disease classification task. Second, we approach the problem of graph representation and comparison for analyzing medical images. Graphs are a technique to represent data with inherited structure. Despite the significant progress in graph kernels, existing graph kernels focus on either unlabeled or discretely labeled graphs, while efficient and expressive representation and comparison of graphs with continuous high-dimensional vector labels, remains an open research problem. We introduce a novel method, the pyramid quantized Weisfeiler-Lehman graph representation to tackle the graph comparison problem for continuous vector labeled graphs. Our algorithm considers statistics of subtree patterns based on the Weisfeiler-Lehman algorithm and uses a pyramid quantization strategy to determine a logarithmic number of discrete labelings. We evaluate our algorithm on two different tasks with real datasets. Overall, as graphs are fundamental mathematical objects and regularization methods are used to control ill-pose problems, both proposed algorithms are potentially applicable to a wide range of domains. Algorithme de Weisfeiler-Lehman Noyaux de graphes Régularisation Weisfeiler-Lehman algorithm Graph kernels Regularization
5	Hypernode graphs for learning from binary relations between sets of objects / Un modèle d'hypergraphes pour apprendre des relations binaires entre des ensembles d'objets Ricatte, Thomas 23 January 2015 (has links) Cette étude a pour sujet les hypergraphes. / This study has for subject the hypergraphs. Hypergraphes Laplaciens de graphe Noyaux de graphe Apprentissage spectral Apprentissage semi-supervisé Algorithmes de skill rating Hypergraphs Graph laplacians Graph kernels Spectral learning Semi-supervised learning Skill rating algorithms
6	Sparsity regularization and graph-based representation in medical imaging Gkirtzou, Aikaterini 17 December 2013 (has links) (PDF) Medical images have been used to depict the anatomy or function. Their high-dimensionality and their non-linearity nature makes their analysis a challenging problem. In this thesis, we address the medical image analysis from the viewpoint of statistical learning theory. First, we examine regularization methods for analyzing MRI data. In this direction, we introduce a novel regularization method, the k-support regularized Support Vector Machine. This algorithm extends the 1 regularized SVM to a mixed norm of both '1 and '2 norms. We evaluate our algorithm in a neuromuscular disease classification task. Second, we approach the problem of graph representation and comparison for analyzing medical images. Graphs are a technique to represent data with inherited structure. Despite the significant progress in graph kernels, existing graph kernels focus on either unlabeled or discretely labeled graphs, while efficient and expressive representation and comparison of graphs with continuous high-dimensional vector labels, remains an open research problem. We introduce a novel method, the pyramid quantized Weisfeiler-Lehman graph representation to tackle the graph comparison problem for continuous vector labeled graphs. Our algorithm considers statistics of subtree patterns based on the Weisfeiler-Lehman algorithm and uses a pyramid quantization strategy to determine a logarithmic number of discrete labelings. We evaluate our algorithm on two different tasks with real datasets. Overall, as graphs are fundamental mathematical objects and regularization methods are used to control ill-pose problems, both proposed algorithms are potentially applicable to a wide range of domains. Weisfeiler-Lehman algorithm Graph kernels Regularization
7	Novel measures on directed graphs and applications to large-scale within-network classification Mantrach, Amin 25 October 2010 (has links) Ces dernières années, les réseaux sont devenus une source importante d’informations dans différents domaines aussi variés que les sciences sociales, la physique ou les mathématiques. De plus, la taille de ces réseaux n’a cessé de grandir de manière conséquente. Ce constat a vu émerger de nouveaux défis, comme le besoin de mesures précises et intuitives pour caractériser et analyser ces réseaux de grandes tailles en un temps raisonnable.<p>La première partie de cette thèse introduit une nouvelle mesure de similarité entre deux noeuds d’un réseau dirigé et pondéré :la covariance “sum-over-paths”. Celle-ci a une interprétation claire et précise :en dénombrant tous les chemins possibles deux noeuds sont considérés comme fortement corrélés s’ils apparaissent souvent sur un même chemin – de préférence court. Cette mesure dépend d’une distribution de probabilités, définie sur l’ensemble infini dénombrable des chemins dans le graphe, obtenue en minimisant l'espérance du coût total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectée dans le réseau est fixée à priori. Le paramètre d’entropie permet de biaiser la distribution de probabilité sur un large spectre :allant de marches aléatoires naturelles où tous les chemins sont équiprobables à des marches biaisées en faveur des plus courts chemins. Cette mesure est alors appliquée à des problèmes de classification semi-supervisée sur des réseaux de taille moyennes et comparée à l’état de l’art.<p>La seconde partie de la thèse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large réseau dont les noeuds sont partiellement étiquetés. Ces algorithmes ont un temps de calcul linéaire en le nombre de noeuds, de classes et d’itérations, et peuvent dés lors être appliqués sur de larges réseaux. Ceux-ci ont obtenus des résultats compétitifs en comparaison à l’état de l’art sur le large réseaux de citations de brevets américains et sur huit autres jeux de données. De plus, durant la thèse, nous avons collecté un nouveau jeu de données, déjà mentionné :le réseau de citations de brevets américains. Ce jeu de données est maintenant disponible pour la communauté pour la réalisation de tests comparatifs.<p>La partie finale de cette thèse concerne la combinaison d’un graphe de citations avec les informations présentes sur ses noeuds. De manière empirique, nous avons montré que des données basées sur des citations fournissent de meilleurs résultats de classification que des données basées sur des contenus textuels. Toujours de manière empirique, nous avons également montré que combiner les différentes sources d’informations (contenu et citations) doit être considéré lors d’une tâche de classification de textes. Par exemple, lorsqu’il s’agit de catégoriser des articles de revues, s’aider d’un graphe de citations extrait au préalable peut améliorer considérablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du réseau de citations, s’aider des informations présentes sur les noeuds n’améliora pas nécessairement les performances.<p>La théorie, les algorithmes et les applications présentés dans cette thèse fournissent des perspectives intéressantes dans différents domaines.<p><p><p>In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. <p>The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.<p>The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e. classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. <p>The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.<p>The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.<p> / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Informatique générale Sciences exactes et naturelles Network computers Kernel functions Graph theory -- Data processing Markov processes Ordinateurs de réseau Noyaux (Mathématiques) Théorie des graphes -- Informatique Markov, Processus de betweenness centrality large scale graphs semi-supervised classification graph kernels

1

Page generated in 0.0659 seconds