Global ETD Search

Return to search

Formal guarantees for heuristic optimization algorithms used in machine learning

Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes (e.g., AdaGrad) to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in machine learning tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms if the theoretical results are suboptimal.

First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero, which is the first guarantee for Delayed AdaGrad in the non-convex setting. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting.

Second, we present an analysis of SGD with exponential and cosine step sizes, which are simple-to-use, empirically successful but lack of theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-Łojasiewicz (PL) condition. We show that these two strategies also have the good property of adaptivity to noise under PL condition.

Third, we study the last iterate of momentum methods. We prove the first lower bound in the general convex setting for the last iterate of SGD with constant momentum. Based on the fact that the lower bound is suboptimal, we investigate a class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based momentum algorithms (FTRL-based SGDM) with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of the number of iterations.

Operations research

Identifer	oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/45053
Date	26 August 2022
Creators	Li, Xiaoyu
Contributors	Orabona, Francesco
Source Sets	Boston University
Language	en_US
Detected Language	English
Type	Thesis/Dissertation

Page generated in 0.0019 seconds

Formal guarantees for heuristic optimization algorithms used in machine learning

Description

Links & Downloads

Tags

Additional Fields