Global ETD Search

31	Agrégation de classements avec égalités : algorithmes, guides à l'utilisateur et applications aux données biologiques / Rank aggregation with ties : algorithms, user guidance et applications to biologicals data Brancotte, Bryan 25 September 2015 (has links) L'agrégation de classements consiste à établir un consensus entre un ensemble de classements (éléments ordonnés). Bien que ce problème ait de très nombreuses applications (consensus entre les votes d'utilisateurs, consensus entre des résultats ordonnés différemment par divers moteurs de recherche...), calculer un consensus exact est rarement faisable dans les cas d'applications réels (problème NP-difficile). De nombreux algorithmes d'approximation et heuristiques ont donc été conçus. Néanmoins, leurs performances (en temps et en qualité de résultat produit) sont très différentes et dépendent des jeux de données à agréger. Plusieurs études ont cherché à comparer ces algorithmes mais celles-ci n’ont généralement pas considéré le cas (pourtant courant dans les jeux de données réels) des égalités entre éléments dans les classements (éléments classés au même rang). Choisir un algorithme de consensus adéquat vis-à-vis d'un jeu de données est donc un problème particulièrement important à étudier (grand nombre d’applications) et c’est un problème ouvert au sens où aucune des études existantes ne permet d’y répondre. Plus formellement, un consensus de classements est un classement qui minimise le somme des distances entre ce consensus et chacun des classements en entrés. Nous avons considérés (comme une grande partie de l’état-de-art) la distance de Kendall-Tau généralisée, ainsi que des variantes, dans nos études. Plus précisément, cette thèse comporte trois contributions. Premièrement, nous proposons de nouveaux résultats de complexité associés aux cas que l'on rencontre dans les données réelles où les classements peuvent être incomplets et où plusieurs éléments peuvent être classés à égalité. Nous isolons les différents « paramètres » qui peuvent expliquer les variations au niveau des résultats produits par les algorithmes d’agrégation (par exemple, utilisation de la distance de Kendall-Tau généralisée ou de variantes, d’un pré-traitement des jeux de données par unification ou projection). Nous proposons un guide pour caractériser le contexte et le besoin d’un utilisateur afin de le guider dans le choix à la fois d’un pré-traitement de ses données mais aussi de la distance à choisir pour calculer le consensus. Nous proposons finalement une adaptation des algorithmes existants à ce nouveau contexte. Deuxièmement, nous évaluons ces algorithmes sur un ensemble important et varié de jeux de données à la fois réels et synthétiques reproduisant des caractéristiques réelles telles que similarité entre classements, la présence d'égalités, et différents pré-traitements. Cette large évaluation passe par la proposition d’une nouvelle méthode pour générer des données synthétiques avec similarités basée sur une modélisation en chaîne Markovienne. Cette évaluation a permis d'isoler les caractéristiques des jeux de données ayant un impact sur les performances des algorithmes d'agrégation et de concevoir un guide pour caractériser le besoin d'un utilisateur et le conseiller dans le choix de l'algorithme à privilégier. Une plateforme web permettant de reproduire et étendre ces analyses effectuée est disponible (rank-aggregation-with-ties.lri.fr). Enfin, nous démontrons l'intérêt d'utiliser l'approche d'agrégation de classements dans deux cas d'utilisation. Nous proposons un outil reformulant à-la-volé des requêtes textuelles d'utilisateur grâce à des terminologies biomédicales, pour ensuite interroger de bases de données biologiques, et finalement produire un consensus des résultats obtenus pour chaque reformulation (conqur-bio.lri.fr). Nous comparons l'outil à la plateforme de références et montrons une amélioration nette des résultats en qualité. Nous calculons aussi des consensus entre liste de workflows établie par des experts dans le contexte de la similarité entre workflows scientifiques. Nous observons que les consensus calculés sont très en accord avec les utilisateurs dans une large proportion de cas. / The rank aggregation problem is to build consensus among a set of rankings (ordered elements). Although this problem has numerous applications (consensus among user votes, consensus between results ordered differently by different search engines ...), computing an optimal consensus is rarely feasible in cases of real applications (problem NP-Hard). Many approximation algorithms and heuristics were therefore designed. However, their performance (time and quality of product loss) are quite different and depend on the datasets to be aggregated. Several studies have compared these algorithms but they have generally not considered the case (yet common in real datasets) that elements can be tied in rankings (elements at the same rank). Choosing a consensus algorithm for a given dataset is therefore a particularly important issue to be studied (many applications) and it is an open problem in the sense that none of the existing studies address it. More formally, a consensus ranking is a ranking that minimizes the sum of the distances between this consensus and the input rankings. Like much of the state-of-art, we have considered in our studies the generalized Kendall-Tau distance, and variants. Specifically, this thesis has three contributions. First, we propose new complexity results associated with cases encountered in the actual data that rankings may be incomplete and where multiple items can be classified equally (ties). We isolate the different "features" that can explain variations in the results produced by the aggregation algorithms (for example, using the generalized distance of Kendall-Tau or variants, pre-processing the datasets with unification or projection). We propose a guide to characterize the context and the need of a user to guide him into the choice of both a pre-treatment of its datasets but also the distance to choose to calculate the consensus. We finally adapt existing algorithms to this new context. Second, we evaluate these algorithms on a large and varied set of datasets both real and synthetic reproducing actual features such as similarity between rankings, the presence of ties and different pre-treatments. This large evaluation comes with the proposal of a new method to generate synthetic data with similarities based on a Markov chain modeling. This evaluation led to the isolation of datasets features that impact the performance of the aggregation algorithms, and to design a guide to characterize the needs of a user and advise him in the choice of the algorithm to be use. A web platform to replicate and extend these analyzes is available (rank-aggregation-with-ties.lri.fr). Finally, we demonstrate the value of using the rankings aggregation approach in two use cases. We provide a tool to reformulating the text user queries through biomedical terminologies, to then query biological databases, and ultimately produce a consensus of results obtained for each reformulation (conqur-bio.lri.fr). We compare the results to the references platform and show a clear improvement in quality results. We also calculate consensus between list of workflows established by experts in the context of similarity between scientific workflows. We note that the computed consensus agree with the expert in a very large majority of cases. Agrégation de classements Agrégation de préférences Top-k Topk Classement de Kemeny optimal Solution exact Guidance Benchmark Résultat de complexité Interrogation de sources biomédicale NP-difficile Rank aggregation Preference aggregation Top-k Topk Optimal Kemey ranking Exact solution Guidance Benchmark Complexity results Querying biomedical sources NP-Hard
32	Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning Weijian Zheng (14226626) 08 December 2022 (has links) <p>High-performance computing (HPC) and machine learning (ML) have been widely adopted by both academia and industries to address enormous data problems at extreme scales. While research has reported on the interactions of HPC and ML, achieving high performance and scalability for parallel and distributed ML algorithms is still a challenging task. This dissertation first summarizes the major challenges for applying HPC to ML applications: 1) poor performance and scalability, 2) loss of the convergence rate, 3) lower quality of the trained model, and 4) a lack of performance optimization techniques designed for specific applications. Researchers can address the four challenges in new ML applications. This dissertation shows how to solve them for two specific applications: 1) a clustering algorithm and 2) graph optimization algorithms that use reinforcement learning (RL).</p> <p>As to the clustering algorithm, we first propose an algorithm called the simulated-annealing clustering algorithm. By combining a blocked data layout and asynchronous local optimization within each thread, the simulated-annealing enhanced clustering algorithm has a convergence rate that is comparable to the K-means algorithm but with much higher performance. Experiments with synthetic and real-world datasets show that the simulated-annealing enhanced clustering algorithm is significantly faster than the MPI K-means library using up to 1024 cores. However, the optimization costs (Sum of Square Error (SSE)) of the simulated-annealing enhanced clustering algorithm became higher than the original costs. To tackle this problem, we devise a new algorithm called the full-step feel-the-way clustering algorithm. In the full-step feel-the-way algorithm, there are L local steps within each block of data points. We use the first local step’s results to compute accurate global optimization costs. Our results show that the full-step algorithm can significantly reduce the global number of iterations needed to converge while obtaining low SSE costs. However, the time spent on the local steps is greater than the benefits of the saved iterations. To improve this performance, we next optimize the local step time by incorporating a sampling-based method called reassignment-history-aware sampling. Extensive experiments with various synthetic and real world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that our parallel algorithms can outperform the fastest open-source MPI K-means implementation by up to 110% on 4,096 CPU cores with comparable SSE costs.</p> <p>Our evaluations of the sampling-based feel-the-way algorithm establish the effectiveness of the local optimization strategy, the blocked data layout, and the sampling methods for addressing the challenges of applying HPC to ML applications. To explore more parallel strategies and optimization techniques, we focus on a more complex application: graph optimization problems using reinforcement learning (RL). RL has proved successful for automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised RL’s ability to solve large scale graph optimization problems due to the lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop OpenGraphGym-MG, a high performance distributed-GPU RL framework for solving graph optimization problems. OpenGraphGym-MG focuses on a class of computationally demanding RL problems in which both the RL environment and the policy model are highly computation intensive. In this work, we distribute large-scale graphs across distributed GPUs and use spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of spatial and data parallelism and highlight their differences. To support graph neural network (GNN) layers that take data samples partitioned across distributed GPUs as input, we design new parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design new parallel graph environments to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance OpenGraphGym-MG training and inference algorithms in parallel.</p> <p>To summarize, after proposing the major challenges for applying HPC to ML applications, this thesis explores several parallel strategies and performance optimization techniques using two ML applications. Specifically, we propose a local optimization strategy, a blocked data layout, and sampling methods for accelerating the clustering algorithm, and we create a spatial parallelism strategy, a parallel graph environment, agent, and policy model, and an optimized replay buffer, and multi-node selection strategy for solving large optimization problems over graphs. Our evaluations prove the effectiveness of these strategies and demonstrate that our accelerations can significantly outperform the state-of-the-art ML libraries and frameworks without loss of quality in trained models.</p> Graph, social and multimedia data Distributed systems and algorithms High performance computing Reinforcement learning High Performance Computing (HPC) Clustering Algorithm Reinforcement Learning combinatorial optimization problems graph problems Travelling salesperson problem Minimum Vertex Cover Problem Distributed processing of data NP-Hard optimization problems model parallelism data parallelism
33	Algorithms for the Maximum Independent Set Problem Lê, Ngoc C. 13 July 2015 (has links) (PDF) This thesis focuses mainly on the Maximum Independent Set (MIS) problem. Some related graph theoretical combinatorial problems are also considered. As these problems are generally NP-hard, we study their complexity in hereditary graph classes, i.e. graph classes defined by a set F of forbidden induced subgraphs. We revise the literature about the issue, for example complexity results, applications, and techniques tackling the problem. Through considering some general approach, we exhibit several cases where the problem admits a polynomial-time solution. More specifically, we present polynomial-time algorithms for the MIS problem in: + some subclasses of $S_{2;j;k}$-free graphs (thus generalizing the classical result for $S_{1;2;k}$-free graphs); + some subclasses of $tree_{k}$-free graphs (thus generalizing the classical results for subclasses of P5-free graphs); + some subclasses of $P_{7}$-free graphs and $S_{2;2;2}$-free graphs; and various subclasses of graphs of bounded maximum degree, for example subcubic graphs. Our algorithms are based on various approaches. In particular, we characterize augmenting graphs in a subclass of $S_{2;k;k}$-free graphs and a subclass of $S_{2;2;5}$-free graphs. These characterizations are partly based on extensions of the concept of redundant set [125]. We also propose methods finding augmenting chains, an extension of the method in [99], and finding augmenting trees, an extension of the methods in [125]. We apply the augmenting vertex technique, originally used for $P_{5}$-free graphs or banner-free graphs, for some more general graph classes. We consider a general graph theoretical combinatorial problem, the so-called Maximum -Set problem. Two special cases of this problem, the so-called Maximum F-(Strongly) Independent Subgraph and Maximum F-Induced Subgraph, where F is a connected graph set, are considered. The complexity of the Maximum F-(Strongly) Independent Subgraph problem is revised and the NP-hardness of the Maximum F-Induced Subgraph problem is proved. We also extend the augmenting approach to apply it for the general Maximum Π -Set problem. We revise on classical graph transformations and give two unified views based on pseudo-boolean functions and αff-redundant vertex. We also make extensive uses of α-redundant vertices, originally mainly used for $P_{5}$-free graphs, to give polynomial solutions for some subclasses of $S_{2;2;2}$-free graphs and $tree_{k}$-free graphs. We consider some classical sequential greedy heuristic methods. We also combine classical algorithms with αff-redundant vertices to have new strategies of choosing the next vertex in greedy methods. Some aspects of the algorithms, for example forbidden induced subgraph sets and worst case results, are also considered. Finally, we restrict our attention on graphs of bounded maximum degree and subcubic graphs. Then by using some techniques, for example ff-redundant vertex, clique separator, and arguments based on distance, we general these results for some subclasses of $S_{i;j;k}$-free subcubic graphs. maximal unabhängige Menge maximal stabile Menge Unabhängigkeitszahl Stabilitätszahl vergrößernde Graphen vergrößernde Knoten Modulzerlegung Cliquenseparator Graphentransformationen pseudo-Boolesche Funktion α-redundante Knoten MIN MAX VO Knotenanordnung Heuristik polynomiale Lösung NP-schwer Maximum Independent Set Stable Set Independence Number Stability Number Augmenting Graph Augmenting Vertex Modular Decomposition Clique Separator Graph Transformations Pseudo-Boolean Function α-redundant Vertex MIN MAX VO Vertex Ordering Heuristic Subcubic Graphs Polynomial Solution NP-hard ddc:510 Stabile-Mengen-Problem
34	Fixed cardinality linear ordering problem, polyhedral studies and solution methods / Problème d'ordre linéaire sous containte de cardinalité, étude polyédrale et méthodes de résolution Neamatian Monemi, Rahimeh 02 December 2014 (has links) Le problème d’ordre linéaire (LOP) a reçu beaucoup d’attention dans différents domaines d’application, allant de l’archéologie à l’ordonnancement en passant par l’économie et même de la psychologie mathématique. Ce problème est aussi connu pour être parmi les problèmes NP-difficiles. Nous considérons dans cette thèse une variante de (LOP) sous contrainte de cardinalité. Nous cherchons donc un ordre linéaire d’un sous-ensemble de sommets du graphe de préférences de cardinalité fixée et de poids maximum. Ce problème, appelé (FCLOP) pour ’fixed-cardinality linear ordering problem’, n’a pas été étudié en tant que tel dans la littérature scientifique même si plusieurs applications dans les domaines de macro-économie, de classification dominante ou de transport maritime existent concrètement. On retrouve en fait ses caractéristiques dans les modèles étendus de sous-graphes acycliques. Le problème d’ordre linéaire est déjà connu comme un problème NP-difficile et il a donné lieu à de nombreuses études, tant théoriques sur la structure polyédrale de l’ensemble des solutions réalisables en variables 0-1 que numériques grâce à des techniques de relaxation et de séparation progressive. Cependant on voit qu’il existe de nombreux cas dans la littérature, dans lesquelles des solveurs de Programmation Linéaire en nombres entiers comme CPLEX peuvent en résoudre certaines instances en moins de 10 secondes, mais une fois que la cardinalité est limitée, ces mêmes instances deviennent très difficiles à résoudre. Sur les aspects polyédraux, nous avons étudié le polytope de FCLOP, défini plusieurs classes d’inégalités valides et identifié la dimension ainsi que certaines inégalités qui définissent des facettes pour le polytope de FCLOP. Nous avons introduit un algorithme Relax-and-Cut basé sur ces résultats pour résoudre les instances du problème. Dans cette étude, nous nous sommes également concentrés sur la relaxation Lagrangienne pour résoudre ces cas difficiles. Nous avons étudié différentes stratégies de relaxation et nous avons comparé les bornes duales par rapport à la consolidation obtenue à partir de chaque stratégie de relâcher les contraintes afin de détecter le sous-ensemble des contraintes le plus approprié. Les résultats numériques montrent que nous pouvons trouver des bornes duales de très haute qualité. Nous avons également mis en place une méthode de décomposition Lagrangienne. Dans ce but, nous avons décomposé le modèle de FCLOP en trois sous-problèmes (au lieu de seulement deux) associés aux contraintes de ’tournoi’, de ’graphes sans circuits’ et de ’cardinalité’. Les résultats numériques montrent une amélioration significative de la qualité des bornes duales pour plusieurs cas. Nous avons aussi mis en oeuvre une méthode de plans sécants (cutting plane algorithm) basée sur la relaxation pure des contraintes de circuits. Dans cette méthode, on a relâché une partie des contraintes et on les a ajoutées au modèle au cas où il y a des de/des violations. Les résultats numériques montrent des performances prometteuses quant à la réduction du temps de calcul et à la résolution d’instances difficiles hors d’atteinte des solveurs classiques en PLNE. / Linear Ordering Problem (LOP) has receive significant attention in different areas of application, ranging from transportation and scheduling to economics and even archeology and mathematical psychology. It is classified as a NP-hard problem. Assume a complete weighted directed graph on V n , \|V n \|= n. A permutation of the elements of this finite set of vertices is a linear order. Now let p be a given fixed integer number, 0 ≤ p ≤ n. The p-Fixed Cardinality Linear Ordering Problem (FCLOP) is looking for a subset of vertices containing p nodes and a linear order on the nodes in S. Graphically, there exists exactly one directed arc between every pair of vertices in an LOP feasible solution, which is also a complete cycle-free digraph and the objective is to maximize the sum of the weights of all the arcs in a feasible solution. In the FCLOP, we are looking for a subset S ⊆ V n such that \|S\|= p and an LOP on these S nodes. Hence the objective is to find the best subset of the nodes and an LOP over these p nodes that maximize the sum of the weights of all the arcs in the solution. Graphically, a feasible solution of the FCLOP is a complete cycle-free digraph on S plus a set of n − p vertices that are not connected to any of the other vertices. There are several studies available in the literature focused on polyhedral aspects of the linear ordering problem as well as various exact and heuristic solution methods. The fixed cardinality linear ordering problem is presented for the first time in this PhD study, so as far as we know, there is no other study in the literature that has studied this problem. The linear ordering problem is already known as a NP-hard problem. However one sees that there exist many instances in the literature that can be solved by CPLEX in less than 10 seconds (when p = n), but once the cardinality number is limited to p (p < n), the instance is not anymore solvable due to the memory issue. We have studied the polytope corresponding to the FCLOP for different cardinality values. We have identified dimension of the polytope, proposed several classes of valid inequalities and showed that among these sets of valid inequalities, some of them are defining facets for the FCLOP polytope for different cardinality values. We have then introduced a Relax-and-Cut algorithm based on these results to solve instances of the FCLOP. To solve the instances of the problem, in the beginning, we have applied the Lagrangian relaxation algorithm. We have studied different relaxation strategies and compared the dual bound obtained from each case to detect the most suitable subproblem. Numerical results show that some of the relaxation strategies result better dual bound and some other contribute more in reducing the computational time and provide a relatively good dual bound in a shorter time. We have also implemented a Lagrangian decomposition algorithm, decom-6 posing the FCLOP model to three subproblems (instead of only two subproblems). The interest of decomposing the FCLOP model to three subproblems comes mostly from the nature of the three subproblems, which are relatively quite easier to solve compared to the initial FCLOP model. Numerical results show a significant improvement in the quality of dual bounds for several instances. We could also obtain relatively quite better dual bounds in a shorter time comparing to the other relaxation strategies. We have proposed a cutting plane algorithm based on the pure relaxation strategy. In this algorithm, we firstly relax a subset of constraints that due to the problem structure, a very few number of them are active. Then in the course of the branch-and-bound tree we verify if there exist any violated constraint among the relaxed constraints or. Then the characterized violated constraints will be globally added to the model. (...) Optimisation Recherche opérationnelle NP-difficile Programmation mathématiques Programmation nombres entiers Méthodes de résolution exacte Polyèdres Dimension de polytope Facette Relaxation Lagrange Décomposition Lagrange Décomposition Benders Méthode de relax-and-cut Méthode de plans sécants Méthode de Bundle Borne duale Optimization Operations research NP-hard Mathematical programming Integer programming Exact solution method Polyhedral Polytope dimension Facet defining inequalities Lagrangian relaxation Lagrangian decomposition Benders decomposition Relax-and-cut Cutting plane algorithm Bundle method Dual bound
35	Algorithms for the Maximum Independent Set Problem Lê, Ngoc C. 18 February 2015 (has links) This thesis focuses mainly on the Maximum Independent Set (MIS) problem. Some related graph theoretical combinatorial problems are also considered. As these problems are generally NP-hard, we study their complexity in hereditary graph classes, i.e. graph classes defined by a set F of forbidden induced subgraphs. We revise the literature about the issue, for example complexity results, applications, and techniques tackling the problem. Through considering some general approach, we exhibit several cases where the problem admits a polynomial-time solution. More specifically, we present polynomial-time algorithms for the MIS problem in: + some subclasses of $S_{2;j;k}$-free graphs (thus generalizing the classical result for $S_{1;2;k}$-free graphs); + some subclasses of $tree_{k}$-free graphs (thus generalizing the classical results for subclasses of P5-free graphs); + some subclasses of $P_{7}$-free graphs and $S_{2;2;2}$-free graphs; and various subclasses of graphs of bounded maximum degree, for example subcubic graphs. Our algorithms are based on various approaches. In particular, we characterize augmenting graphs in a subclass of $S_{2;k;k}$-free graphs and a subclass of $S_{2;2;5}$-free graphs. These characterizations are partly based on extensions of the concept of redundant set [125]. We also propose methods finding augmenting chains, an extension of the method in [99], and finding augmenting trees, an extension of the methods in [125]. We apply the augmenting vertex technique, originally used for $P_{5}$-free graphs or banner-free graphs, for some more general graph classes. We consider a general graph theoretical combinatorial problem, the so-called Maximum -Set problem. Two special cases of this problem, the so-called Maximum F-(Strongly) Independent Subgraph and Maximum F-Induced Subgraph, where F is a connected graph set, are considered. The complexity of the Maximum F-(Strongly) Independent Subgraph problem is revised and the NP-hardness of the Maximum F-Induced Subgraph problem is proved. We also extend the augmenting approach to apply it for the general Maximum Π -Set problem. We revise on classical graph transformations and give two unified views based on pseudo-boolean functions and αff-redundant vertex. We also make extensive uses of α-redundant vertices, originally mainly used for $P_{5}$-free graphs, to give polynomial solutions for some subclasses of $S_{2;2;2}$-free graphs and $tree_{k}$-free graphs. We consider some classical sequential greedy heuristic methods. We also combine classical algorithms with αff-redundant vertices to have new strategies of choosing the next vertex in greedy methods. Some aspects of the algorithms, for example forbidden induced subgraph sets and worst case results, are also considered. Finally, we restrict our attention on graphs of bounded maximum degree and subcubic graphs. Then by using some techniques, for example ff-redundant vertex, clique separator, and arguments based on distance, we general these results for some subclasses of $S_{i;j;k}$-free subcubic graphs. info:eu-repo/classification/ddc/510 ddc:510 Stabile-Mengen-Problem

Page generated in 1.1904 seconds