Common RNN Models

Introduction

RNN is a special kind of MLP that takes its output as input to remember the previous information. It is such an simple struture that it can't remember as much information as we want. Besides, since RNN updates hidden state recurrently, the cumulative chain of gradients will become rather long, which may result in gradient explosion or gradient vanishing.

GRU

Gated Recurrent Unit (GRU) is a method to solve gradient anomaly in RNN. Its idea is based on three facts:

  • Some tokens have nothing to do with previous tokens.
  • Some tokens are closely related to previous tokens.
  • Current token may make no sense. It is better to just ignore its influence.

Hence, GRU defines two kinds of gate to catch long-term and short-term relationship:

  • Reset Gate $\mathbf{R}_t$, catches short-term relationship between current token and previous tokens:
    $$\mathbf{R}_t=\sigma(\mathbf{X}_t\mathbf{W} _{xr}+\mathbf{H} _{t-1}\mathbf{W} _{hr}+\mathbf{b}_r)$$
  • Update Gate $\mathbf{Z}_t$, catches long-term relationship between current token and previous tokens:
    $$\mathbf{Z}_t=\sigma(\mathbf{X}_t\mathbf{W} _{xz}+\mathbf{H} _{t-1}\mathbf{W} _{hz}+\mathbf{b}_z)$$

Both values of $\mathbf{R}_t$ and $\mathbf{Z}_t$ ($\sigma$ means sigmoid) are between $0$ and $1$. $\mathbf{R}_t$ is used to generate candidate hidden state:

$$\tilde{\mathbf{H}}_t=\tanh(\mathbf{X}_t\mathbf{W} _{xh}+(\mathbf{R}_t\odot\mathbf{H} _{t-1})\mathbf{W} _{hh}+\mathbf{b}_h)$$

As its name suggests, reset gate $\mathbf{R}_t$ will weigh the influence of previous tokens $\mathbf{H} _{t-1}$ on current sequence and gather the information of current token and previous tokens. Since $\mathbf{R}_t$ is always smaller than $1$, it pays more attention to current token $\mathbf{X}_t$ while weakens $\mathbf{H} _{t-1}$. That's why we say that $\mathbf{R}_t$ catches short-term relationship.

After generating the candidate hidden state, update gate $\mathbf{Z}_t$ generates real $\mathbf{H}_t$ from previous sequence $\mathbf{H} _{t-1}$ and current candidate sequence $\tilde{\mathbf{H}}_t$:

$$\mathbf{H}_t=\mathbf{Z}_t\odot\mathbf{H} _{t-1}+(1-\mathbf{Z}_t)\odot\tilde{\mathbf{H}}_t$$

When $\mathbf{Z}_t$ is close to $0$, it means that the current token make sense and it may have something to do with previous tokens (based on $\mathbf{R}_t$). On the contrary, when $\mathbf{Z}_t$ is close to $1$, it means that the current token makes no sense. $\mathbf{Z}_t$ weighs $\mathbf{H} _{t-1}$ and $\tilde{\mathbf{H}}_t$. That's why we say that $\mathbf{Z}_t$ catches long-term relationship.

1

Fig. 1. Computing process in GRU

Compared with RNN, GRU requires more model parameters ($\mathbf{W} _{xr}$, $\mathbf{W} _{hr}$, $\mathbf{W} _{xz}$, $\mathbf{W} _{hz}$, $\mathbf{W} _{xh}$, $\mathbf{W} _{hh}$ and corresponding bias $\mathbf{b}$). All of them are learnable parameters in one GRU layer. The two gates help the agent properly forget current or previous information. This will shorten the gradient cumulative chain to a certain degree.

nn.GRU is the class of GRU in PyTorch. It is almost the same as nn.RNN though it has more model parameters.

LSTM

Long-Short-Term Memory (LSTM) is almost the same as GRU but it is more complicated.

2

Fig. 2. Computing process in LSTM

Compared with GRU, LSTM use memory cell $\mathbf{C}_t$ to record information. The function of $\mathbf{C}_t$ in LSTM is similar as $\mathbf{H}_t$'s in GRU. But $\mathbf{C}_t$ only propagates inside LSTM and the output of LSTM is still $\mathbf{H}_t$. In addition, LSTM takse $\mathbf{Z}_t$ and $(1-\mathbf{Z}_t)$ apart to form forget gate $\mathbf{F}_t$ and input gate $\mathbf{I}_t$:

$$
\begin{cases}
\mathbf{F}_t=\sigma(\mathbf{X}_t\mathbf{W} _{xf}+\mathbf{H} _{t-1}\mathbf{W} _{hf}+\mathbf{b}_f) \\
\mathbf{I}_t=\sigma(\mathbf{X}_t\mathbf{W} _{xi}+\mathbf{H} _{t-1}\mathbf{W} _{hi}+\mathbf{b}_i)
\end{cases}
$$

Similarly, LSTM use $\tilde{\mathbf{C}}_t$ to record the relationship between current token and previous tokens but it doesn't reset previous tokens:

$$\tilde{\mathbf{C}}_t=\tanh(\mathbf{X}_t\mathbf{W} _{xc}+\mathbf{H} _{t-1}\mathbf{W} _{hc}+\mathbf{b}_c)
$$

Then, LSTM generate memory cell $\mathbf{C}_t$ by weighing previous information $\mathbf{C} _{t-1}$ and current information $\tilde{\mathbf{C}}_t$:

$$\mathbf{C}_t=\mathbf{F}_t\odot\mathbf{C} _{t-1}+\mathbf{I}_t\odot\tilde{\mathbf{C}}_t$$

Since $\mathbf{F}_t$ and $\mathbf{I}_t$ are no longer relevant. LSTM can combine previous information and current information more flexibly. $\mathbf{C}_t$ only flows inside LSTM. To generate the output, LSTM use output gate $\mathbf{O}_t$ for reset gate $\mathbf{R}_t$:

$$
\begin{cases}
\mathbf{O}_t=\sigma(\mathbf{X}_t\mathbf{W} _{xo}+\mathbf{H} _{t-1}\mathbf{W} _{ho}+\mathbf{b}_o)\\
\mathbf{H}_t=\mathbf{O}_t\odot\tanh(\mathbf{C}_t)
\end{cases}
$$

where $\tanh$ is to make sure that $\mathbf{H}_t$ is range from $-1$ to $1$ so that the agent can control gradients better.

nn.LSTM is the class of LSTM in PyTorch. It is almost the same as nn.GRU.

Though LSTM is more complex, GRU performs as well as LSTM.

DRNN

Deep Recurrent Neural Networks (DRNNs) are neural networks with multiple hidden layers. Their structures are the same as MLP's.

BRNN

Bidirectional Recurrent Neural Networks (BRNNs) are neural networks that consider not only the leftward context but also the rightward context. In some scenarios, a token is not only associated with its previous tokens but also its future tokens, for example, contextual analysis and cloze. In this case, it is better for the agent to consider the future tokens in addition to the previous tokens. And that's what BRNNs does:

3

Fig. 3. BRNNs

BRNNs assign two independent RNN layers with the same structure to each hidden layer, where one deals with the leftward context and another one deals with the rightward context (just takse the antisequence as input):

$$
\begin{cases}
\overrightarrow{\mathbf{H}} _t=\text{activation}(\mathbf{X}_t\mathbf{W} _{xh} ^{(f)}+\overrightarrow{\mathbf{H}} _{t-1}\mathbf{W} _{hh} ^{(f)}+\mathbf{b} _h ^{(f)})\\
\overleftarrow{\mathbf{H}} _t=\text{activation}(\mathbf{X}_t^r\mathbf{W} _{xh} ^{(b)}+\overleftarrow{\mathbf{H}} _{t-1}\mathbf{W} _{hh} ^{(b)}+\mathbf{b} _h ^{(b)})
\end{cases}
$$

Then, the concatenation of $\overrightarrow{\mathbf{H}} _t$ and $\overleftarrow{\mathbf{H}} _t$ is passed as output $\mathbf{H} _t=(\overrightarrow{\mathbf{H}} _t,\overleftarrow{\mathbf{H}} _t)$ to the next layer.

BRNNs only work when the sequence contains information about future. Hence, BRNNs can't predict the future. Instead, BRNNs are usually used to extract features from a sequence, just like what convolutional layers do.

There is keyword bidirectional in nn.RNN, nn.GRU and nn.LSTM. To enable BRNN, we can just set bidirectional=True.