Common RNN Models

Introduction

RNN is a special kind of MLP that takes its output as input to remember the previous information. It is such an simple struture that it can't remember as much information as we want. Besides, since RNN updates hidden state recurrently, the cumulative chain of gradients will become rather long, which may result in gradient explosion or gradient vanishing.

GRU

Gated Recurrent Unit (GRU) is a method to solve gradient anomaly in RNN. Its idea is based on three facts:

  • Some tokens have nothing to do with previous tokens.
  • Some tokens are closely related to previous tokens.
  • Current token may make no sense. It is better to just ignore its influence.

Hence, GRU defines two kinds of gate to catch long-term and short-term relationship:

  • Reset Gate Rt, catches short-term relationship between current token and previous tokens:
    Rt=σ(XtWxr+Ht1Whr+br)
  • Update Gate Zt, catches long-term relationship between current token and previous tokens:
    Zt=σ(XtWxz+Ht1Whz+bz)

Both values of Rt and Zt (σ means sigmoid) are between 0 and 1. Rt is used to generate candidate hidden state:

H~t=tanh(XtWxh+(RtHt1)Whh+bh)

As its name suggests, reset gate Rt will weigh the influence of previous tokens Ht1 on current sequence and gather the information of current token and previous tokens. Since Rt is always smaller than 1, it pays more attention to current token Xt while weakens Ht1. That's why we say that Rt catches short-term relationship.

After generating the candidate hidden state, update gate Zt generates real Ht from previous sequence Ht1 and current candidate sequence H~t:

Ht=ZtHt1+(1Zt)H~t

When Zt is close to 0, it means that the current token make sense and it may have something to do with previous tokens (based on Rt). On the contrary, when Zt is close to 1, it means that the current token makes no sense. Zt weighs Ht1 and H~t. That's why we say that Zt catches long-term relationship.

Fig. 1. Computing process in GRU

Compared with RNN, GRU requires more model parameters (Wxr, Whr, Wxz, Whz, Wxh, Whh and corresponding bias b). All of them are learnable parameters in one GRU layer. The two gates help the agent properly forget current or previous information. This will shorten the gradient cumulative chain to a certain degree.

nn.GRU is the class of GRU in PyTorch. It is almost the same as nn.RNN though it has more model parameters.

LSTM

Long-Short-Term Memory (LSTM) is almost the same as GRU but it is more complicated.

Fig. 2. Computing process in LSTM

Compared with GRU, LSTM use memory cell Ct to record information. The function of Ct in LSTM is similar as Ht's in GRU. But Ct only propagates inside LSTM and the output of LSTM is still Ht. In addition, LSTM takse Zt and (1Zt) apart to form forget gate Ft and input gate It:

{Ft=σ(XtWxf+Ht1Whf+bf)It=σ(XtWxi+Ht1Whi+bi)

Similarly, LSTM use C~t to record the relationship between current token and previous tokens but it doesn't reset previous tokens:

C~t=tanh(XtWxc+Ht1Whc+bc)

Then, LSTM generate memory cell Ct by weighing previous information Ct1 and current information C~t:

Ct=FtCt1+ItC~t

Since Ft and It are no longer relevant. LSTM can combine previous information and current information more flexibly. Ct only flows inside LSTM. To generate the output, LSTM use output gate Ot for reset gate Rt:

{Ot=σ(XtWxo+Ht1Who+bo)Ht=Ottanh(Ct)

where tanh is to make sure that Ht is range from 1 to 1 so that the agent can control gradients better.

nn.LSTM is the class of LSTM in PyTorch. It is almost the same as nn.GRU.

Though LSTM is more complex, GRU performs as well as LSTM.

DRNN

Deep Recurrent Neural Networks (DRNNs) are neural networks with multiple hidden layers. Their structures are the same as MLP's.

BRNN

Bidirectional Recurrent Neural Networks (BRNNs) are neural networks that consider not only the leftward context but also the rightward context. In some scenarios, a token is not only associated with its previous tokens but also its future tokens, for example, contextual analysis and cloze. In this case, it is better for the agent to consider the future tokens in addition to the previous tokens. And that's what BRNNs does:

Fig. 3. BRNNs

BRNNs assign two independent RNN layers with the same structure to each hidden layer, where one deals with the leftward context and another one deals with the rightward context (just takse the antisequence as input):

{Ht=activation(XtWxh(f)+Ht1Whh(f)+bh(f))Ht=activation(XtrWxh(b)+Ht1Whh(b)+bh(b))

Then, the concatenation of Ht and Ht is passed as output Ht=(Ht,Ht) to the next layer.

BRNNs only work when the sequence contains information about future. Hence, BRNNs can't predict the future. Instead, BRNNs are usually used to extract features from a sequence, just like what convolutional layers do.

There is keyword bidirectional in nn.RNN, nn.GRU and nn.LSTM. To enable BRNN, we can just set bidirectional=True.