RNN: a Special Kind of MLP
Sequence model & Language model
Sequence information is the data arranged in a certain order, whose biggest feature is contexutal correlation. Language, or text, is typical sequence information. When dealing with sequence information, what the neural network does is to predict the future based on the history, that is:
where
Such a model or such a relationship is called sequence model. In sequence model, the data is not independent but sequential. We use previous data to predict future data. However, when the sequence is extremely long, the quantity of previous data will be rather large. In general, there are two methods to cope with it.
Markov assumption
Markov assumption (or Markov model) assumes that the future
where
Latent autoregressive models
In this model, we use a new parameter
where
In a word,
is the feature of . The aim of RNN is to obtain the mapping relation for markov models or and for latent autoregressive models.
Language models
The language model is a typical sequence model. However, because languages are in the form of string, it is extremely hard for computers to cope with them. Hence, we always divide a text into several tokens. A token is a string and a word in the original text. Then, we count the probablity of occurrence of all tokens or token sequences and use markov models to model language models. The value of
One-hot encoding is used to turn a token into a vector so that the neural network can work.
RNN
RNNs (recurrent neural networks) are neural networks with hidden state which is another input of the hidden layer and is updated by calling the hidden layer recurrrently.
The hidden state H is actually the same as input X, that is, they are both inputs of hidden layer:
The following picture shows the nature of RNN better:
In each timing, a token or other sequential unit
Though we use the RNN layer several times (depends on the length of sequence) to generate
in a single iteration, it is still a layer. Namely, during one iteration, , and keep the same. What changes is .
When the batch is generated by random sampling, that is, the sequence of different batches is not continuous (e.g. batch1: [1, 2], batch2: [8, 9]),
The api of RNN in PyTorch is
nn.RNN.
Gradient clipping
Timesteps represent the length of a sequence or
To solve this, what we use is gradient clipping. For all the parameters that require gradient, we put their gradient in
Usually,
Perplexity
When coping with language model using RNN, what we actually do is to make the agent choose the best token from vocabulary. This is actually a softmax regression problem. However, in NLP, we don't use the average of crossentropy loss to measure its precision. Instead, we use its exponent, perplexity:
They are actually the same but perplexity shows the range of tokens that the agent can choose. Namely, when perplexity is 1, which means that the agent can choose a specific token without hesitation, this is the ideal result. When perlexity is bigger than 1, this means that there are still some tokens confusing the agent.
Structure of RNN
There are many types of RNN structure. Different structures are suitable for different functions. For instance, in NLP, one to many structure is suitable for text generation, that is, predicting the subsequent text using current text. Many to one structure is suitable for text categorization. It tries to remember the whole text and only output one value. Many to many structure is fit for machine translation and question answering. Seq2seq is a typical many to many model.




