RNN: a Special Kind of MLP

Sequence model & Language model

Sequence information is the data arranged in a certain order, whose biggest feature is contexutal correlation. Language, or text, is typical sequence information. When dealing with sequence information, what the neural network does is to predict the future based on the history, that is:

$$x_t\sim P(x_t|x_1,...,x _{t-1})$$

where $x_t$ is correlated with its history $(x_1,...,x _{t-1})$. Similarly, if we wanna predict a new sequence, we deal with:

$$(x_1,...,x_T)\sim \prod\limits_{t=1}^TP(x_t|x_1,...,x _{t-1})$$

Such a model or such a relationship is called sequence model. In sequence model, the data is not independent but sequential. We use previous data to predict future data. However, when the sequence is extremely long, the quantity of previous data will be rather large. In general, there are two methods to cope with it.

Markov assumption

Markov assumption (or Markov model) assumes that the future $x_t$ is only correlated with a small span of the past $(x _{t-1},...,x _{t-\tau})$ where $\tau$ is the span. In this case:

$$x_t\sim P(x_t|x_{t-\tau},...,x _{t-1})$$

$\tau$ is an import hyperparameter that determines the complexity of prediction. When $\tau=m$, the model is called mth-order Markov model:

$$(x_1,...,x_T)\sim \prod\limits_{t=1}^TP(x_t|x _{\max (t-\tau,0)},...,x _{t-1})$$

where $x_0$ has nothing to do with $x_i$, that is, it is independent from others. Such a model is also called Autoregressive model.

Latent autoregressive models

In this model, we use a new parameter $h_t$ to summarize the past information:

$$h_t=g(h_{t-1},x_{t-1})$$

$$\hat{x}_t=P(x_t|h_t)$$

where $h_t$ is the latent variable.

1

Fig. 1. Latent autoregressive models

In a word, $(x _{t-\tau},...,x _{t-1})$ is the feature of $x_t$. The aim of RNN is to obtain the mapping relation $x_t=f(x _{t-\tau},...,x _{t-1})$ for markov models or $h_t=g(h _{t-1},x _{t-1})$ and $\hat{x}_t=f(h_t)$ for latent autoregressive models.

Language models

The language model is a typical sequence model. However, because languages are in the form of string, it is extremely hard for computers to cope with them. Hence, we always divide a text into several tokens. A token is a string and a word in the original text. Then, we count the probablity of occurrence of all tokens or token sequences and use markov models to model language models. The value of $\tau$ decides the number of tokens we take into account to predict $x_t$. For instance, one token (tokens are independent from each other) is Unigram, two tokens are Bigram and three tokens are Trigram.

One-hot encoding is used to turn a token into a vector so that the neural network can work.

RNN

RNNs (recurrent neural networks) are neural networks with hidden state which is another input of the hidden layer and is updated by calling the hidden layer recurrrently.

2

Fig. 2. Hidden state

The hidden state H is actually the same as input X, that is, they are both inputs of hidden layer:

$$\text{Output}_t=H_t=\phi(X _tW _{xh}+H _{t-1}W _{hh}+b_h)$$

The following picture shows the nature of RNN better:

3

Fig. 3. CNN

In each timing, a token or other sequential unit $X_t$ enters the RNN as an input. $H_{t-1}$ records the information of previous tokens. They together produce $H_t$. $H_t$ records the information of current token $X_t$ and previous tokens. Hence, it is only the output of this hidden layer at this timing but also the input of the hidden layer at the next timing. That's why such a neural network is called RNN: for a sequence with several tokens, all the tokens will be fed to RNN in sequence. Their relationship is recorded by $H$. The updated $H$ is fed to RNN recurrently. And finally, after dealing with the last token, the prediction is made.

4

Fig. 4. Process of RNN

Though we use the RNN layer several times (depends on the length of sequence) to generate $H_1,...,H_t$ in a single iteration, it is still a layer. Namely, during one iteration, $W_{xh}$, $W_{hh}$ and $b_h$ keep the same. What changes is $H$.

When the batch is generated by random sampling, that is, the sequence of different batches is not continuous (e.g. batch1: [1, 2], batch2: [8, 9]), $H$ must be initialized to 0 in each iteration. Otherwise (e.g. batch1: [1, 2], batch2: [3, 4]), $H$ should be kept as the last result of the former batch. By doing so, the current batch and the former batch form a longer sequence. In practice, we use random sampling more as the text we cope with is often too long to remember all of it only using $H$.

The api of RNN in PyTorch is nn.RNN .

Gradient clipping

Timesteps represent the length of a sequence or $\tau$ in markov assumption. The larger the timestep is, the more information the RNN need to remember. In addition, because the operation of a RNN hidden layer of k timesteps is equivalent to the operation of k dense layers, it is more likely for RNN to have gradient explosion.

To solve this, what we use is gradient clipping. For all the parameters that require gradient, we put their gradient in $\mathbf{g}$ and project $\mathbf{g}$ back to a sphere of given radius $\theta$:

$$\mathbf{g}\leftarrow\min(1,\frac{\theta}{\left|\mathbf{g}\right|})\mathbf{g}$$

Usually, $\theta$ is 5.

Perplexity

When coping with language model using RNN, what we actually do is to make the agent choose the best token from vocabulary. This is actually a softmax regression problem. However, in NLP, we don't use the average of crossentropy loss to measure its precision. Instead, we use its exponent, perplexity:

$$\exp(-\frac{1}{n}\sum\limits_{t=1}^n\log P(x_t|x_{t-1},...,x_1))$$

They are actually the same but perplexity shows the range of tokens that the agent can choose. Namely, when perplexity is 1, which means that the agent can choose a specific token without hesitation, this is the ideal result. When perlexity is bigger than 1, this means that there are still some tokens confusing the agent.

Structure of RNN

There are many types of RNN structure. Different structures are suitable for different functions. For instance, in NLP, one to many structure is suitable for text generation, that is, predicting the subsequent text using current text. Many to one structure is suitable for text categorization. It tries to remember the whole text and only output one value. Many to many structure is fit for machine translation and question answering. Seq2seq is a typical many to many model.

5

Fig. 5. Different structures of RNN