The Encoder-Decoder Architecture

Another view to ANNs

Firstly, let's explain CNNs from the perspective of encoder-decoder:

1

Fig. 1. View CNNs in a new perspective

When the agent extracts features from the input using convolutional layers and pooling layer, it encodes the image as a vector that better reflects some certain characteristics of the image. After that, the agent decodes these features and generates output.

Similarly, for CNNs, the embedding layer and LSTM (GRU) layer encode texts as vectors. Then, The agent decodes these message and generates output.

2

Fig. 2. View RNNs in a new perspective

This is the encoder-decoder architecture. In this architecture, the encoder deals with input , passes information to the decoder and the decoder generates output. More generally, the information the encoder passes to the decoder is called state which contains features of the input. The decoder can also receive new input so that it will get more information about the thing it copes with.

3

Fig. 3. Encoder-Decoder

Seq2seq

Sequence to sequence (Seq2seq) is a kind of neural network using the encoder-decoder architecture to complete the task of sequence to sequence learning. It is applied mainly in machine translation.

4

Fig. 4. Machine translation and Seq2seq (English to French)

5

Fig. 5. Structure of Seq2seq

Encoder

The encoder is a DRNN without regression layers. It works like the other DRNNs: take the source sequence (English) as input and output vectors. What is worth noticing is the embedding layer. In fact, it is an improvement on one-hot encoding. When using one-hot encoding, we have to use a $t\times m$ tensor for a sentence with $t$ tokens and a vocabulary with $m$ tokens. This is space-consuming. However, the embedding layer can map $m$-dimensional vectors to $k$-dimensional vectors ($k<m$). This cuts the size of space we have to spend to store a sentence.

Decoder

The decoder takes the target sequence (French) $\mathbf{X}$ as a part of input and also uses the embedding layer. In addition, it also takes the last hidden state of the last RNN layer of the encoder $\mathbf{C}$ as a part of input. $\mathbf{C}$ contains the context about the source sequence. $[\mathbf{X}, \mathbf{C}]$ form the complete input and are fed to the decoder. The last hidden states of all the RNN layers of the encoder are also used to initialize the hidden states of all the RNN layers of the decoder. Hence, the depth of RNN layer in encoder and decoder should be the same (that is $n$).

For convenience, the length of all the sequences ($t$) is fixed. Hence, for the sentences that are longer than $t$, they are cut to $t$; for the sentences that are shorter than $t$, they are filled up to $t$. When computing the loss using softmax, we should only consider the valid length and set the filled part to $0$.

Training and predicting

When training, what we feed to the decoder are the correct translation sequence, like Fig. 4. However, when predicting, we only feed the token that indicates the start of a sentence (<bos>) to the decoder and the output is fed as input to the decoder until the decoder generates the token that indicates the end of a sentence (<eos>).

6

Fig. 6. Making prediction (translation)

Assessment

Bilingual evaluation understudy (BLEU) is a method to evaluate the accuracy of machine translation or other sequence to sequence application. BLEU is always range between $0$ and $1$. The larger BLEU is, the more accurate the prediction is. It is defined as:

$$
\exp(\min(0,1-\frac{\text{len} _{label}}{\text{len} _{pred}}))\prod\limits _{n=1}^k p _n ^{1/2^n}
$$

where $\text{len} _{label}$ is the number of tokens in target sequence $y$ while $\text{len} _{pred}$ is the number of tokens in prediction sequence $\hat{y}$. Since shorter predictions contain less tokens, which makes $\prod\limits _{n=1}^k p _n ^{1/2^n}$ bigger, there will be penalty for short predictions.

The smaller $\text{len} _{pred}$ is, the larger $\text{len} _{label}/{\text{len} _{pred}}$ is, and thus the smaller $\exp(\min(...))$ is.

$p_n$ is the accuracy of $n$-grams, namely, the accuracy of prediction of $n$ tokens in sequence. Due to the difficulty to predict long sequence, long sequence will get higher weight than short sequence. $1/2^n$ help us do this. Because $p_n$ is often smaller than $1$ and the bigger $n$ is, the smaller $1/2^n$ is, long sequence will always get larger value even though $p_n$ keeps the same.

7

Fig. 7. Curve of pn1/2n(pn=0.5)

Beam search

When predicting a token using softmax, we are actually using Greedy search. That is, for each timestep, the token we choose is the one with the highest probability.

8

Fig. 8. Greedy search

However, local optimum does not necessarily lead to global optimum. If we choose B in timestep1 and feed it to decoder in timestep2, maybe we will get higher accuracy for the prediction in timestep2. Exhaustive search can solver this problem but it is time-consuming.

Beam search is a compromise between greedy search and Exhaustive search. In the first timestep, beam search will choose k tokens with high probability and finally generate k condidate sequence. Then, sequence with the highest probability is chosen. k is a hyperparameter called beam size.

9

Fig. 9. Beam search