Numerical Stability
Numerical instability
A numerically instable neural network is often caused by exploding or vanishing gradients, which make the model diverge or converge slowly. According to chain rules, for a neural network with $d$ layers, the gradient of layer d with respect to parameters of layer t is:
$$\frac{\partial{h^d}}{\partial{h^t}}=\prod_{i=t}^{d-1}\frac{\partial{h^{i+1}}}{\partial{h^i}}=\prod_{i=t}^{d-1}diag(\sigma'(W^ih^{i-1}))W^i$$
where $h^{i}=\sigma(W^ih^{i-1})$, $h^i$ is the input of layer i+1 and the output of layer i.
Exploding gradients
If we use ReLU as the activation:
$$\sigma(x)=\max(0,x)$$
Then, the element of $diag(\sigma'(W^ih^{i-1}))$ will only be 1 or 0. Therefore, it is possible that the gradients will become:
$$\prod_{i=t}^{d-1}W^i$$
which will make the gradients rather large if $d-t$ is large and cause some serious problems:
- Gradients are
nan
orinf
; - The model is very sensitive to the learning rate $\alpha$, which makes the training harder.
Vanishing gradients
If we use sigmoid as the activation, since the gradient of sigmoid becomes smooth when $|x|$ is larger than 4, it is likely that the gradients are rather small, even zero:
- Floating point underflow;
- No progress in training, especially for the bottom layers because $d-t$ is large.
Mul to add
ResNet
See ResNet to know more about it.
LSTM
See LSTM to know more about it.
Normalization
Gradient normalization
Gradient clipping
See Gradient clipping to know more about it.
Model parameter and activation
Another method to increase numerical stability is to keep the expectation and variance of the outputs and inputs of each layer the same:
$$E(h_i^t)=0\And Var(h_i^t)=a$$
and the expectation and variance of the gradients of each layer the same:
$$E(\frac{\partial{\ell}}{\partial{h_i^t}})=0\And Var(\frac{\partial{\ell}}{\partial{h_i^t}})=b$$
We can realize this by choosing an appropriate initail value of $w$ ($b$ is offset so we can ignore it) and a proper activation.
Parameter initialization
For model parameters of layer t with $E=0\And Var=\gamma_t$, we should keep $n_{t-1}\gamma_t=1$ to achieve the mean-variance requirements of forward propagation (inputs and outputs) and keep $n_t\gamma_t=1$ to achieve the mean-variance requirements of backward propagation (gradients), where $n_t$ is the number of neurons of layer t.
We cannot satisfy both conditions so we often use Xavier to initialize $w$:
$$\frac{\gamma_t(n_{t-1}+n_t)}2=1\rightarrow\gamma_t=\frac{2}{(n_{t-1}+n_t)}$$
Some examples:
- $N(0,\sqrt{2/(n_{t-1}+n_t)})$;
- $U(-\sqrt{6/(n_{t-1}+n_t)},\sqrt{6/(n_{t-1}+n_t)})$
Activation choosing
The activation should be $y=x$ when it is around (0,0):
- $sigmoid(x)=\frac{1}{2}+\frac{x}{4}+O(x^3)$
- $tanh(x)=x+O(x^3)$
- $relu(x)=x$ for $x\ge0$
Therefore, we often use relu or tanh or 4$\times$sigmoid(x)-2