Numerical Stability

Posted on 2023-05-07 Edited on 2023-08-09 In Dive Into Deep Learning Views: Word count in article: 538 Reading time ≈ 2 mins.

In deep learning, numerical stability refers to making the training of the model stable and feasible. More spefically, numerical stability is the stability of model parameters, outputs and gradients.

Numerical instability

A numerically instable neural network is often caused by exploding or vanishing gradients, which make the model diverge or converge slowly. According to chain rules, for a neural network with $d$ layers, the gradient of layer d with respect to parameters of layer t is:

$$\frac{\partial{h^d}}{\partial{h^t}}=\prod_{i=t}^{d-1}\frac{\partial{h^{i+1}}}{\partial{h^i}}=\prod_{i=t}^{d-1}diag(\sigma'(W^ih^{i-1}))W^i$$

where $h^{i}=\sigma(W^ih^{i-1})$, $h^i$ is the input of layer i+1 and the output of layer i.

Exploding gradients

If we use ReLU as the activation:

$$\sigma(x)=\max(0,x)$$

Then, the element of $diag(\sigma'(W^ih^{i-1}))$ will only be 1 or 0. Therefore, it is possible that the gradients will become:

$$\prod_{i=t}^{d-1}W^i$$

which will make the gradients rather large if $d-t$ is large and cause some serious problems:

Gradients are nan or inf;
The model is very sensitive to the learning rate $\alpha$, which makes the training harder.

Vanishing gradients

If we use sigmoid as the activation, since the gradient of sigmoid becomes smooth when $|x|$ is larger than 4, it is likely that the gradients are rather small, even zero:

Floating point underflow;
No progress in training, especially for the bottom layers because $d-t$ is large.

Mul to add

ResNet

See ResNet to know more about it.

LSTM

See LSTM to know more about it.

Normalization

Gradient normalization

Gradient clipping

See Gradient clipping to know more about it.

Model parameter and activation

Another method to increase numerical stability is to keep the expectation and variance of the outputs and inputs of each layer the same:

$$E(h_i^t)=0\And Var(h_i^t)=a$$

and the expectation and variance of the gradients of each layer the same:

$$E(\frac{\partial{\ell}}{\partial{h_i^t}})=0\And Var(\frac{\partial{\ell}}{\partial{h_i^t}})=b$$

We can realize this by choosing an appropriate initail value of $w$ ($b$ is offset so we can ignore it) and a proper activation.

Parameter initialization

For model parameters of layer t with $E=0\And Var=\gamma_t$, we should keep $n_{t-1}\gamma_t=1$ to achieve the mean-variance requirements of forward propagation (inputs and outputs) and keep $n_t\gamma_t=1$ to achieve the mean-variance requirements of backward propagation (gradients), where $n_t$ is the number of neurons of layer t.

We cannot satisfy both conditions so we often use Xavier to initialize $w$:

$$\frac{\gamma_t(n_{t-1}+n_t)}2=1\rightarrow\gamma_t=\frac{2}{(n_{t-1}+n_t)}$$

Some examples:

$N(0,\sqrt{2/(n_{t-1}+n_t)})$;
$U(-\sqrt{6/(n_{t-1}+n_t)},\sqrt{6/(n_{t-1}+n_t)})$

Activation choosing

The activation should be $y=x$ when it is around (0,0):

$sigmoid(x)=\frac{1}{2}+\frac{x}{4}+O(x^3)$
$tanh(x)=x+O(x^3)$
$relu(x)=x$ for $x\ge0$

Therefore, we often use relu or tanh or 4$\times$sigmoid(x)-2