Return to search

Formal guarantees for heuristic optimization algorithms used in machine learning

Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes (e.g., AdaGrad) to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in machine learning tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms if the theoretical results are suboptimal.

First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero, which is the first guarantee for Delayed AdaGrad in the non-convex setting. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting.

Second, we present an analysis of SGD with exponential and cosine step sizes, which are simple-to-use, empirically successful but lack of theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-Ɓojasiewicz (PL) condition. We show that these two strategies also have the good property of adaptivity to noise under PL condition.

Third, we study the last iterate of momentum methods. We prove the first lower bound in the general convex setting for the last iterate of SGD with constant momentum. Based on the fact that the lower bound is suboptimal, we investigate a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based momentum algorithms (FTRL-based SGDM) with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of the number of iterations.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/45053
Date26 August 2022
CreatorsLi, Xiaoyu
ContributorsOrabona, Francesco
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation

Page generated in 0.0018 seconds