Transformer: Self-Attention and Parallelization

Posted on 2023-06-07 Edited on 2023-08-11 In Dive Into Deep Learning Views: Word count in article: 1.4k Reading time ≈ 5 mins.

The transformer which is distinguished by its adoption of self-attention and multi-head attention, is a deep learning model using the encoder-decoder architecture. It can be used in both CV and NLP. The encoder of transformer generates BERT and the decoder of transformer generates GPT.

Model

Transformer deals with sequence based on the encoder-decoder architecture. Compared to seq2seq which uses attention mechanisms as well as RNNs, transformer is purely based on the attention mechanism. Both of its encoder and decoder consist of $n$ transformer blocks.

Fig. 1. Architecture of transformer

The outpur of encoder is $\mathbf{y}_1,...,\mathbf{y}_n$, where $\mathbf{y}_i$ is the encoding of the $i$-th token. Except for the last encoder block, the output of other encoder blocks is the input for their next encoder block. The output of the last encoder block is the keys and values for the multi-head attention layer of all the decoder blocks whose queries are the output of masked multi-head attention layer. As a result, the shape of the output of the encoder and decoder should be the same.

Fig. 2. Information transfer between encoder and decoder

The code of transformer in PyTorch: Dive into Deep Learning.

Layer Normalization

BatchNorm: The mean of each feature dimension becomes 0, and the variance becomes 1 (two-dimensional: column).
LayerNorm：The mean of all features of each sample becomes 0, and the variance becomes 1 (two-dimensional: row).

But for sequence data, it is generally three-dimensional, that is (batch_size, sequence_length, features). For different sequences, the effective length of the sequence may be different. At this time, the jitter of the mean and variance of different sequences calculated by BatchNorm will be larger, which will affect the global mean and variance, and when the length of the predicted sequence is much longer or shorter than the training sequence, the global mean and variance may not be very useful. LayerNorm normalizes all features of each sample, that is, each sample only considers itself, so it is relatively stable.

2_1

Fig. 3. BatchNorm and LayerNorm

Encoder

The encoder is made up of one embedding layer and $n$ encoder blocks. Each encoder block consists of two sublayers: multi-head self-attention and positionwise feed-forward networks. The shape of output of each layer is the same so hyperparameters that can be adjusted are the number of encoder block and the shape of output.

Inputs of encoder are keys, values and queries, which are all the source sequence. The shape of feature of the input after embedding layer is the same as the output. The input of the first encoder block is the original sequence after encoding while the others are outputs of the previous encoder block.

All layers (except for the embedding layer) of encoder and decoder don't change the shape of samples.

Embedding & Positional Encoding

The embedding layer maps the feature dimension of each key, value and query to the same dimension num_hiddens, which is the output dimension of the encoder, and then adds location information to all keys, values and queries.

Since keys, values and queries are the same thing at this point, only one variable needs to be processed. key_size, value_size, and query_size which are the shape of features of keys, values and queries, are all the same thing.

Since the location information is between $-1$ and $1$, before adding the location information, it is necessary to multiply the samples processed by the embedding layer by the square root of num_hiddens in order to ensure that the value of the feature of each token is not much smaller than the location information.

Encoder Block

The encoder block consists of two modules: a multi-head self-attention layer and a position-wise FFN layer. After each module, a residual connection and layer normalization are performed to train a deeper network (residual connection first and then layer normalization).

This also reflects that in transformer, the feature dimensions of the input and output of each module are always constant. This is very different from the traditional CNNs.

Multi-Head Attention

Transformer uses dot-product attention. Therefore, in order to allow the multi-head attention layer to have learned parameters, before each head of keys, values, and queries, they must first be newly mapped with a fully connected layer.

Fig. 4. Multi-head attention and fully connected layer

The attention is actually a weighted sum of different values, so the dot-product attention does not change the shape of input. However, the fully connected layer may change the shape of input. The feature dimension of the sample after the multi-head attention layer should remain unchanged, so the number of heads is actually related to the output dimension of the feature:

$$
\text{heads}\times\text{num_hiddens_FC}=\text{num_hiddens}
$$

Therefore, the fully connected layers before different heads can be merged into a large fully connected layer with model parameters W_q, W_k and W_v, which process queries, keys, and values respectively. The output size of the layer is $\text{num_hiddens}$. In this way, the parallelism will be better.

Fig. 5. Large fully connected layer

The shape of the output obtained from the above operations is (batch_size, num_queries or key-value_pairs, num_hiddens). After the deformation operation, the shape can be changed to (batch_size, num_queries or key-value_pairs, heads, num_hiddens/heads). In this way, the different heads are separated again, and the output of each head can be calculated independently. After the attention is gathered and the output of each head is concatenated, a linear transformation is performed again. This transformation still does not change the feature dimension, but increases the number of learned parameters (W_o).

Position-wise Feed-Forward Networks

This module is actually an MLP with only one hidden layer, which still does not change the feature dimension of the input.

Decoder

The decoder is made up of one embedding layer and $n$ decoder blocks. The input of decoder is targeted sequence. The input of the first decoder block is the targeted sequence after embedding layer while the others are outputs of the previous decoder block.

When making predictions, the work of the encoder has not changed, and the original sequence can be processed directly. The decoder, on the other hand, has to work sequentially because the predictions have to be obtained one by one. That is to say, when predicting the (t+1)-th output, the keys and values of input of decoder are 0~t predicted values, while the query is the t-th predicted value.

Embedding & Positional Encoding

Functions and operations are in the same way as the encoder, except that the input is the targeted sequence.

Decoder Block

The decoder block consists of three modules: masked multi-head attention, multi-head self-attention and position-wise FFN. Similar to the encoder, there are also residual connections and layer normalization after each module.

Masked Multi-Head Attention

This module extracts the feature information of the targeted sequence. Its mode is almost the same as the multi-head self-attention of the encoder. The difference is that it adds a mask operation. The purpose of the mask operation is to make the transformer have the same autoregressive behavior mode during training and prediction. At predicting time, the decoder cannot see all targeted sequences at once, while at training time it can. Therefore, during training, a mask is used to cover the tokens after the current time step, making it invisible.

The mechanism is to make the attention score of the token after the current time step be negative infinity before $\text{softmax}$, so that the attention weights assigned to them are infinitely close to $0$. Then, we can achieve the purpose of shielding future tokens:

$$
\text{softmax}(\text{masked}(\mathbf{QK}^\text{T}))\mathbf{V}
$$

Multi-Head Attention

The same to the encoder, but the keys and values come from the final output of the encoder.

Position-wise Feed-Forward Networks

The same to the encoder.

Dense Layer

It is a fully connected layer that implements softmax regression.

By default, PyTorch treats the last dimension as the feature dimension.