CNN: Feature Extraction
From Dense to Convolution
The convolutional layer is the key layer of CNN. It is a special kind of dense layer. CNN is used to deal with images. In this section, we take the grayscale image as example, that is, the feature of an input image is a 2-D matrix. When processing images, firstly, it is reasonable that agents's awareness of a certain object should not be overly concerned with the precise location of the object in the image because what the object looks like has nothing to do with its position. Besides, the features of one object should only be related to the pixels around it as each pixel only determines the depth of the color and we distinguish objects by their color boundaries. These two priciples are called translation invariance and locality respectively. If fulfilling both priciples, dense layers become convolutional layers.
Constructing a dense layer
For convinience, we keep the dimension of a input image $X$ 2-D and make the model parameters of one neuron $w$ have the same dimension as $X$. Under such setting, the output of one neuron become:
1 | (X*w).sum() # w is a matrix |
Now that the output of a hidden layer $H$ should be a image that has the same shape as $X$, we have to place $m*n$ neurons in the hidden layer, where $m$ and $n$ are the dimension of $X$. Hence, for a hidden layer, its model parameter $W$ is:
$$
{\begin{bmatrix}
{\begin{bmatrix}
w_{0,0,0,0} & w_{0,0,0,1}\\
w_{0,0,1,0} & w_{0,0,1,1}
\end{bmatrix}
\begin{bmatrix}
w_{0,1,0,0} & w_{0,1,0,1}\\
w_{0,1,1,0} & w_{0,1,1,1}
\end{bmatrix}}\\
{\begin{bmatrix}
w_{1,0,0,0} & w_{1,0,0,1}\\
w_{1,0,1,0} & w_{1,0,1,1}
\end{bmatrix}
\begin{bmatrix}
w_{1,1,0,0} & w_{1,1,0,1}\\
w_{1,1,1,0} & w_{1,1,1,1}
\end{bmatrix}}
\end{bmatrix}} _{m\times n\times m\times n=2\times 2\times 2\times 2}
$$
And an element of output $H$ is:
$$
[H]_{i,j}=\sum_k \sum_l {[W]} _{i,j,k,l} {[X]} _{k,l} + {[U]} _{i,j}
$$
where $U$ contains bias of each output pixel.
Translation invariance
Now we can clearly see that in order to process an image with $m * n$ data, we almost use $(m * n)^2$ model parameters. This is extremely space-consuming. However, if we apply translation invariance priciple to the dense layer, we can cut it to $m * n$.
As mentioned above, the activation of a pixel should have nothing to do with its position. This is only possible when ${[W]} _{i,j}$ and ${[U]} _{i,j}$ are both the same for any $(i,j)$. Hence, for a certain hidden layer, we actually just need a 2-D $W$ and a scalar $u$:
$$
{[H]} _{i,j}=\sum_k\sum_l{[W]} _{k,l} {[X]} _{k,l}+u
$$
Locality
So far, we have significantly cut the space of model parameters but it is still too large. Now it is time for locality. As motivated above, there is no need to look so far away from the location of input pixel ${[X]}_ {i,j}$. We just need to consider a small window around ${[X]} _{i,j}$:
$${[H]}_ {i, j} = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} {[V]}_ {a, b} {[X]}_ {i+a, j+b}.$$
where $\Delta\times\Delta$ is the size of the window and $V$ (size $\Delta\times\Delta$) is the model parameter inside the window, a filter that generates the feature of a zone in an image or a convolution kernel.
Now we construct a convolutional layer (nn.Conv2d
) with hyperparameters kernel size and stride (see below).
$$Y = X \star W+b$$
$W$ and $b$ are model parameters that the agent can learn. $\star$ is cross-correlation operator which is slightly different from convolution but their behaviors are the same in the convolutional layer (see below). The dimensions of $Y$ are smaller than the original dimensions, which is:
$$(n_h-k_h+1)\times(n_w-k_w+1)$$
where $n_h$ and $n_w$ are the dimensions of $X$, $k_h$ and $k_w$ are the dimensions of $W$. It doesn't matter and we can solve it easily (see below).
Convolution, Fourier and Neural Networks
Convolution is an operation on two functions ($f$ and $g$) that produces a third function ($f*g$) that express how the shape of one is modified by the other:
$$(f*g)(t)=\int_{-\infty} ^{+\infty}f(\tau)g(t-\tau)d\tau$$
Physically, we can explain it from two perspectives.
Stock
Suppose there is a system with input $f(t)$ at each moment. The input starts decaying with $g(t)$ by the time it enters the system, where $g(t)$ is its remaining percentage.
For items that enter the system between $(t,t+\Delta t)$, their margin at $x$ is:
$$f(t)g(x-t)\Delta t$$
Therefore, for items that enter the system between $(t_1,t_2)$, their margin at $x$ is:
$$\int_{t_1}^{t_2}f(t)g(x-t)dt$$
In this case, we take $f(t)$ as the unstable input and $g(t)$ as the stable output to compute the stock of a system after some time.
Influence
In the explaination above, we actually make $dt$ relate to $f(t)$. Namely, $f(t)dt$ is the input of the system during $t$ and $t+\Delta t$. $g(x-t)$ has nothing to do with $dt$. It is just a decaying factor.
However, we can also take $g(x-t)dt$ as a whole. This will make the $t$ and $f * g$ more meaningful. For example, we can regard $f(t)$ as a thing happens at $t$ and $g(t)$ as its influence over time. Then $f*g$ can be regarded as the totally influence at $x$ caused by things that happen during $t_1$ and $t_2$:
$$\int_{t_1}^{t_2}f(t)g(x-t)dt$$
To go a step further, we can take $t$ as position, $f(t)$ as a thing happens at $t$ and keep moving from $t$ to $x$, $g(t)$ as its award or loss per distance. For example, some people whose weight $f(t)$ starts running as $t$ and they will loss $g(t)$ kg per meter. Then $f*g$ is the weight loss of all. (the fatter a person is, the more contribution he/she makes).
Convolution in CNN
For convolution in CNN, we'd better expand its dimension to 2-D and take $t$ as a discrete number which represents position $(x,y)$. Then, we can take $X$ as function $f$ and $W$ as $g$. Now the meaning of convolution in CNN can be interpreted as the influence that pixels around $f(x,y)$ exert on pixel $f(x,y)$ or the certain average feature of the pixels around $f(x,y)$.
$$f * g=\sum _{i,j}f(i,j)g(x-i,y-j)$$
The phrase certain average feature may be a little confusing. However, if we regard the convolution kernel $W$ as a feature filter that extract specific features of a region of an image, it makes sense. Different filters extract different features and their parameters are totally learnt by the agent itself.
There are still some differences between convolution and the convolutional layer, that is, $g$ is not the kernel we actually use. For example, convolution computes $f(x-1,y-1)g(1,1)$ rather than $f(x-1,y-1)g(-1,-1)$. However, in CNN, we actually compute $f(x-1,y-1)g(-1,-1)$. In fact, it doesn't matter. We just need to turn $g$ around and we get the $W$ we use in CNN. Hence, what the agent computes is cross-correlation rather than convolution but the results are the same so we just call it convolution.
Hyperparameters of convolutional layer
Kernel size, padding, stride, output (input) channels are the new hyperparameters that we can adjust in the convolutional layer.
- Kernel size determines the size of window. It is the simplest parameter among them. We usually set it to $3\times3$ or $5\times5$ for 2-D input.
- Padding is used to solve the dimensionality reduction problem of the output image.
- Stride determines the movement of window. The stride we use above is $1\times1$.
- The number of input channels is not a hyperparameter of the convolutional layer but it determines the size of input.
- The number of output channels determines the number of images the convolutional layer outputs.
Kernel size
It is recommended to adopt a small kernel size and deep neural networks instead of a large kernel size and shallow neural networks. Their results are almost equivalent but the speed of former is faster as the speed is inversely related to the number of data (For kernel, this is generally on the order of the square). That's why we always make its size 3, 5 instead of 11, 13.
The size here refers to the length of every dimension.
Such a structure is very reasonable as what the kernel does is to gather the features of the edge points together. Therefore, in the final layer, data that the kernel see is the linear combination of all pixels in the image. Namely, its kernel size is actually the whole image:
Padding
Padding adds extra rows and columns around the input images:
After padding, the shape of output becomes:
$$(n_h-k_h+1+p_h)\times(n_w-k_w+1+p_w)$$
Generally, we let:
$$
p_h=k_h-1,\space p_w=k_w-1
$$
so that the shape of output is the same as the shape of input. For convinience, we often take $k_h\And k_w$ to be odd number. Therefore, we can pad $p_h/2$ ($p_w/2$)
on both sides. If $k_h\And k_w$ are even, $\lceil p_h/2\rceil$ on the upper side and $\lfloor p_h/2\rfloor$ on the lower side or vice versa.
Stride
Stirde refers to the step size of the kernel window on the row and column. When stride is too small, the agent will need a amount of computation to get a small ouput.
For a certain stride $s_h\times s_w$, the shape of output becomes:
$$\lfloor(n_h-k_h+p_h)/s_h+1\rfloor\times\lfloor(n_w-k_w+p_w)/s_w+1\rfloor
$$
If:
$$
p_h=k_h-1,\space p_w=k_w-1
$$
The shape becomes:
$$\lfloor(n_h-1)/s_h+1\rfloor\times\lfloor(n_w-1)/s_w+1\rfloor
$$
We can roughly regard that the stride will cut the shape of output by a factor of $s_h\And s_w$.
Multiple input channels
The role of multiple input channels is to combine different features together.
- $X$: $c_i\times n_h\times n_W$
- $W$: $c_i\times k_h\times k_w$
- $Y$: $m_h\times m_w$
For each input channel, there is a kernel. The output of the multiple input channels is the weighted sum of the output of each channel. Namely, even though there are several input channels, they just generate one output. To a certain extent, a kernel can be regarded as one of the generalized $w$ of a neuron and the multiple input channels are the model parameters of a neuron.
Multiple output channels
An output channel generate an output. It contains an independent batch of kernels which form a 3-D kernel. In other words, the multiple input channels are parts of an output channel. That's why the input channel is not a hyperparameter. Its size totally depends on the data. For examples, the number of input channels should 3 for RGB images and 1 for grayscale images.
- $X$: $c_i\times n_h\times n_W$
- $W$: $c_o\times c_i\times k_h\times k_w$
- $Y$: $c_o\times m_h\times m_w$
- $B$ (bias): $c_o\times c_i$, one kernel one bias.
The computational complexity is roughly $O(c_ik_hk_wc_om_hm_w)$, where $c_ik_hk_w$ is the complexity of computing an output pixel and $c_om_hm_w$ is the number of output pixels.
$1\times 1$ kernel
The $1\times 1$ kernel is a special kernel which doesn't recognize spatial patterns but just fuses channels. It can be regarded as a dense layer with input $n_hn_w\times c_i$ ($c_i$ is the number of features) and weight $c_ic_o$.
Pooling layer
The pooling layer is another type of layer in CNN. It has two functions:
- Reduce the sensitivity of convolutional layers to position;
- Reduce the computational complexity.
Sensitivity to position
The convolutional layer is highly sensitive to the position of pixels. This makes a pixel offset will cause a large change in the output image. However, we may not want such a significant change. Therefore, we need pooling layer to reduce these effects.
The hyperparameters of the pooling layer are almost the same as the convolutional layer. But, the pooling layer is just an operator, that is, it doesn't have any model parameters to learn. Besides, it doesn't fuse channels. Therefore, the number of input channels is equal to output channels in the pooling layer.
There are generally two types of pooling layer:
- Maximum pooling
nn.MaxPool2d
, computes the maximum value of all elements in the pooling window; - Average pooling
nn.AvgPool2d
, computes the average value of all elements in the pooling window.
Computational complexity
Since the pooling layer windows are the same as kernels in convolutional layer, the pooling layer can reduce the dimension of data.
However, because of the increase in data diversities and its consistency with the convolutional layer in dimensionality reduction, the pooling layer are no longer so important.