Global ETD Search

1	Data sampling strategies in stochastic algorithms for empirical risk minimization Csiba, Dominik January 2018 (has links) Gradient descent methods and especially their stochastic variants have become highly popular in the last decade due to their efficiency on big data optimization problems. In this thesis we present the development of data sampling strategies for these methods. In the first four chapters we focus on four views on the sampling for convex problems, developing and analyzing new state-of-the-art methods using non-standard data sampling strategies. Finally, in the last chapter we present a more flexible framework, which generalizes to more problems as well as more sampling rules. In the first chapter we propose an adaptive variant of stochastic dual coordinate ascent (SDCA) for solving the regularized empirical risk minimization (ERM) problem. Our modification consists in allowing the method to adaptively change the probability distribution over the dual variables throughout the iterative process. AdaSDCA achieves a provably better complexity bound than SDCA with the best fixed probability distribution, known as importance sampling. However, it is of a theoretical character as it is expensive to implement. We also propose AdaSDCA+: a practical variant which in our experiments outperforms existing non-adaptive methods. In the second chapter we extend the dual-free analysis of SDCA, to arbitrary mini-batching schemes. Our method is able to better utilize the information in the data defining the ERM problem. For convex loss functions, our complexity results match those of QUARTZ, which is a primal-dual method also allowing for arbitrary mini-batching schemes. The advantage of a dual-free analysis comes from the fact that it guarantees convergence even for non-convex loss functions, as long as the average loss is convex. We illustrate through experiments the utility of being able to design arbitrary mini-batching schemes. In the third chapter we study importance sampling of minibatches. Minibatching is a well studied and highly popular technique in supervised learning, used by practitioners due to its ability to accelerate training through better utilization of parallel processing power and reduction of stochastic variance. Another popular technique is importance sampling { a strategy for preferential sampling of more important examples also capable of accelerating the training process. However, despite considerable effort by the community in these areas, and due to the inherent technical difficulty of the problem, there is no existing work combining the power of importance sampling with the strength of minibatching. In this chapter we propose the first importance sampling for minibatches and give simple and rigorous complexity analysis of its performance. We illustrate on synthetic problems that for training data of certain properties, our sampling can lead to several orders of magnitude improvement in training time. We then test the new sampling on several popular datasets, and show that the improvement can reach an order of magnitude. In the fourth chapter we ask whether randomized coordinate descent (RCD) methods should be applied to the ERM problem or rather to its dual. When the number of examples (n) is much larger than the number of features (d), a common strategy is to apply RCD to the dual problem. On the other hand, when the number of features is much larger than the number of examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide the first joint study of these two approaches when applied to L2-regularized ERM. First, we show through a rigorous analysis that for dense data, the above intuition is precisely correct. However, we find that for sparse and structured data, primal RCD can significantly outperform dual RCD even if d ≪ n, and vice versa, dual RCD can be much faster than primal RCD even if n ≫ d. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the (bound on the) number of iterations and the overall expected complexity of RCD. Note that the latter complexity measure also takes into account the average cost of the iterations, which depends on the structure and sparsity of the data, and on the sampling strategy employed. We confirm our theoretical predictions using extensive experiments with both synthetic and real data sets. In the last chapter we introduce two novel generalizations of the theory for gradient descent type methods in the proximal setting. Firstly, we introduce the proportion function, which we further use to analyze all the known block-selection rules for coordinate descent methods under a single framework. This framework includes randomized methods with uniform, non-uniform or even adaptive sampling strategies, as well as deterministic methods with batch, greedy or cyclic selection rules. We additionally introduce a novel block selection technique called greedy minibatches, for which we provide competitive convergence guarantees. Secondly, the whole theory of strongly-convex optimization was recently generalized to a specific class of non-convex functions satisfying the so-called Polyak- Lojasiewicz condition. To mirror this generalization in the weakly convex case, we introduce the Weak Polyak- Lojasiewicz condition, using which we give global convergence guarantees for a class of non-convex functions previously not considered in theory. Additionally, we give local convergence guarantees for an even larger class of non-convex functions satisfying only a certain smoothness assumption. By combining the two above mentioned generalizations we recover the state-of-the-art convergence guarantees for a large class of previously known methods and setups as special cases of our framework. Also, we provide new guarantees for many previously not considered combinations of methods and setups, as well as a huge class of novel non-convex objectives. The flexibility of our approach offers a lot of potential for future research, as any new block selection procedure will have a convergence guarantee for all objectives considered in our framework, while any new objective analyzed under our approach will have a whole fleet of block selection rules with convergence guarantees readily available.
2	New combinatorial features of knots and virtual knots Mortier, Arnaud 12 July 2013 (has links) (PDF) Un nœud est un plongement du cercle dans une variété de dimension 3. Dans la sphère S3 , les nœuds peuvent être codés combinatoirement par des diagrammes de Gauss. Ceux-ci peuvent être étudiés indépendamment, en oubliant les véritables nœuds: c'est ce qu'on appelle la théorie des nœuds virtuels. En première partie nous définissons une version générale de nœuds virtuels, dépendant d'un groupe G muni d'un morphisme à valeurs dans Z/2. Lorsque ces paramètres sont bien choisis, la théorie obtenue généralise les nœuds dans une surface épaissie quelconque (c'est-à-dire un fibré en droites réelles sur une surface). Outre l'encodage des nœuds, les diagrammes de Gauss sont aussi un outil puissant pour décrire les invariants de type fini de Vassiliev. En seconde partie, nous donnons un ensemble complet de critères pour détecter ces invariants. Notamment, le critère d'invariance sous Reidemeister III est une réponse positive à une conjecture de M.Polyak. Parmi les exemples donnés figure une nouvelle preuve et une généralisation du théorème de Grishanov-Vassiliev sur les invariants par chaînes planaires. La troisième partie est une ébauche de plan visant à trouver un algorithme pour décider si un diagramme donné dans l'anneau R × S1 représente une tresse fermée dans le tore solide, à isotopie près. La première étape est franchie, consistant à trouver un critère reconnaissant les diagrammes de Gauss des tresses fermées. Nous conjecturons que ce critère suffit pour les diagrammes à nombre minimal de croisements, et proposons des pistes dans cet objectif. La dernière partie est un travail commun avec T.Fiedler, explorant les propriétés d'objets non génériques liés à l'espace de toutes les immersions du cercle dans R3 . Cet espace est de dimension infinie, stratifié par le degré de non généricité des immersions. Alors que la théorie de Vassiliev se cantonne à l'étude des strates contenant uniquement des points doubles ordinaires, ici nous interdisons ces points doubles et autorisons uniquement un certain type de points triples. Nous montrons que l'espace qui en résulte n'est pas simplement connexe en exhibant un 1-cocycle non trivial. Une pondération de ce 1-cocycle fournit une nouvelle formule pour l'invariant de Casson des nœuds. Nœuds virtuels Diagrammes de Gauss Algèbre de Polyak Invari- ants de type fini tresses invariant de Casson
3	Understanding and Accelerating the Optimization of Modern Machine Learning Liu, Chaoyue January 2021 (has links) No description available. Artificial Intelligence Computer Science Deep learning Neural networks optimization acceleration gradient descent SGD neural tangent kernel transition to linearity Polyak-Lojasiewicz condition momentum

1

Page generated in 0.0385 seconds