Optimization Algorithms

Momentum method

SGD introduces randomness so that gradient descent may jump out of local minimum points and saddle points. What's more, the computational complexity of SGD is lower than the regular gradient descent.

However, because of the randomness, SGD may oscillate around the optimum point. Besides, different model parameters may require different learning rate. When all model parameters share the same learning rate, it is likely that some model parameters with large gradient diverge and others with small gradient converge slowly. A smaller learning rate may solve this problem but it will slow the progress of the training of the whole model.

The momentum method helps us solve the problem above. It introduces a new parameter $\mathbf{v}_t$ (speed) to provide inertia for the gradient:

$$
\begin{cases}
\mathbf{v}_t=\beta\mathbf{v} _{t-1}+\mathbf{g} _t\\
\mathbf{w}_t =\mathbf{w} _{t-1}-\alpha\mathbf{v} _t
\end{cases}
$$

where $\beta$ is a hyperparameter that determines the magnitude of inertia. When $\beta$ is $0$, it reverts to the regular gradient descent. Compared to the regular gradient descent, the momentum method gives a certain weight to the gradient of the old moment $t_1$ at the new moment $t_2$. Under extreme conditions:

$$
\begin{cases}
\tau=t_2-t_1=\infty\\
\sum\limits _{\tau=0} ^\infty\beta ^\tau\mathbf{g} _{t_1}=\frac{1}{1-\beta}\mathbf{g} _{t_1}
\end{cases}
$$

that is, the weight of gradient is $1/(1-\beta)$. By doing so, the size of the gradient is guaranteed and gradents are more likely to descend in a better direction.

When using the momentum method in SGD, we should maintain a auxiliary variable $\mathbf{v}$. It is a vector and is initialized to $0$. It should record the history of all batches and iteration rounds. Namely, it is initialized only once.

In PyTorch, it has been included in torch.optim.SGD, we only need to pass the hyperparameter $\beta$ (momentum) to PyTorch.

Adam

Adam combines many techniques (momentum, adagrad, etc.) into one efficient learning algorithm. The mechanism of adam is quite simple, it uses the second moment to adjust the size of the gradient based on the momentum method so as to achieve the purpose of dynamically adjusting the learning rate:

$$
\begin{cases}
\mathbf{v}_t=\beta _1 \mathbf{v} _{t-1} + (1-\beta _1) \mathbf{g}_t \\
\mathbf{s}_t=\beta _2 \mathbf{s} _{t-1}+ (1-\beta _2) \mathbf{g}_t ^2
\end{cases}
$$

Similarly, under extreme conditions:

$$
\begin{cases}
\sum\limits _{\tau=0} ^\infty\beta _1 ^\tau(1-\beta _1)\mathbf{g} _{t_1}=\mathbf{g} _{t_1}\\
\sum\limits _{\tau=0} ^\infty\beta _2 ^\tau(1-\beta _2) \mathbf{g} _{t_1} ^2=\mathbf{g} _{t_1} ^2
\end{cases}
$$

However, when $\tau$ is not large enough, we can't get the formula above. Hence, we usually standardize $\mathbf{v}$ and $\mathbf{s}$:

$$
\begin{cases}
\hat{\mathbf{v}}_t=\frac{\mathbf{v}_t}{1-\beta_1^t}\\
\hat{\mathbf{s}}_t=\frac{\mathbf{s}_t}{1-\beta_2^t}
\end{cases}
$$

Now, the gradient is:

$$
\mathbf{g}_t'=\alpha\frac{\hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t}+\epsilon}
$$

where $\epsilon$ is to make sure that division by $0$ errors will not occur. Its value is usually $10^{-6}$.

And the gradient descent is:

$$
\mathbf{w}_t=\mathbf{w} _{t-1}-\mathbf{g}_t'\space\text{or}\space\mathbf{w}_t=\mathbf{w} _{t-1}-\frac{\alpha}{\sqrt{\hat{\mathbf{s}}_t}+\epsilon}\hat{\mathbf{v}}_t
$$

where the latter reflects the fact that the adam algorithm dynamically adjusts the learning rate $\alpha$ and is not sensitive to the learning rate, just like what we talk about in Adam.

In PyTorch, we can use torch.optim.Adam to call adam algorithm. The only hyperparameter we need to pass to it is $\alpha$.