Common CNN Models

LeNet-5

LeNet-5 which was published in 1998, is one of the simplest CNN models. It was originally used to recognize handwritten numbers between 0 and 9.

1

Fig. 1. LeNet-5

Its structure is quite simple from today's perspective: start with two convolutional and pooling layers and end with three dense layers (the last layer is softmax). However, it pioneered the template of CNN, that is:

  1. Start with convolutional and pooling layers to extract features and reduce data dimensionality;
  2. End with dense layers to solve the problems.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# structure of LeNet-5
class Reshape(nn.Module):
def forward(self, x):
return x.view((-1, 1, 28, 28)) # reshape the dimension of input data

net = nn.Sequential(
Reshape(),
nn.Conv2d(1, 6, kernel_size=5, padding=2), # add padding to get 32*32
nn.Sigmoid(),
nn.AvgPool2d(2), # windows of the pooling layer in pytorch do not overlap by default
nn.Conv2d(6, 16, kernel_size=5),
nn.Sigmoid(),
nn.AvgPool2d(2),
nn.Flatten(),
nn.Linear(400, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10))

AlexNet

AlexNet started the craze for neural networks in 2012, before which, the kernel function and SVM were more powerful in machine learning. The key of kernel function is manual feature extraction. With features, it can calculate their correlation using the kernel function and transform it into a convex optimization problem. After that, SVM will work well by feeding these features to it. It has strong beautiful theorems and can be explained mathematically.

AlexNet is actually a larger LeNet. It improves on LeNet and increases the depth of the neural network:

  • Dropout: control the overfitting of deeper neural networks;
  • ReLu: make gradients larger;
  • MaxPooling: make gradients larger and the model easier to train;
  • ImageNet: larger dataset than Mnist;
  • Data augmentation: create more samples and reduce the sensitivity of convolutional layers to position.

AlexNet changed the way people think about machine learning:

  • Learning features through CNN instead of extracting manually;
  • Training classifiers (traditional ML methods) together with feature extrators (CNN).

2

Fig. 2. AlexNet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# structure of AlexNet
# size of each image: 224*224
Net = nn.Sequential(
nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), # the dataset is Mnist, if ImageNet, 1 should be 3
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(96, 256, kernel_size=5, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(256, 384, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(384, 384, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Flatten(),
nn.Linear(6400, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 10) # dataset is Mnist
)

VGG

VGG (Visual Geometry Group) encapsulates the convolutional and pooling layers in AlexNet into blocks so that they can be reused and the network will be more standardized.

3

Fig. 3. VGG

Usually, for each block, the height and width of matrices are halved, and the number of channels is doubled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# structure of VGG
# size of each image: 224*224

def vgg_block(num_convs, in_channels, out_channels):
"""
definition of block.

Parameters:
num_convs: the number of convolutional layer in this block
in_channels: the number of input channels in the first convolutional layer
out_channels: the number of output channels of the block

Return:
nn.Sequential
"""
layers = []
for _ in range(num_convs):
layers.append(nn.Conv2d(
in_channels, out_channels, kernel_size=3, padding=1))
layers.append(nn.ReLU())
in_channels = out_channels
layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
return nn.Sequential(*layers)

conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
def vgg(conv_arch):
"""
definition of vgg.

Parameters:
conv_arch: structure of all the blocks, including number of convolutional
layer of each block and the number of output channels of each block

Return:
the whole VGG network
"""
conv_blks = []
in_channels = 1
for (num_convs, out_channels) in conv_arch:
conv_blks.append(vgg_block(
num_convs, in_channels, out_channels))
in_channels = out_channels

return nn.Sequential(
*conv_blks, nn.Flatten(),
nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(p=0.5), # 224 / 2^5 = 7
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.Linear(4096, 10)
)

NiN

NiN (Network in Network) follows the block idea of VGG. However, it removes the dense layer completely as the dense layer need too much memory. As an alternative to the dense layer, NiN utilizes the $1\times1$ kernel and ReLU to add nonlinearity to each pixel and uses global average pooling layer to replace the last dense layer, which compute the mean of each input channel.

The function of global average pooling layer is the same as the last dense layer but it requires less computation. Softmax is realized by loss function CrossEntropy.

4

Fig. 4. NiN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# structure of NiN

def nin_block(in_channels, out_channels, kernel_size, strides, padding):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding), nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())

net = nn.Sequential(
nin_block(1, 96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2d(3, stride=2),
nin_block(96, 256, kernel_size=5, strides=1, padding=2),
nn.MaxPool2d(3, stride=2),
nin_block(256, 384, kernel_size=3, strides=1, padding=1),
nn.MaxPool2d(3, stride=2),
# just normalization, we can remove it
nn.Dropout(0.5),
# the number of labels is 10
nin_block(384, 10, kernel_size=3, strides=1, padding=1),
# the shape of output is 1x1
nn.AdaptiveAvgPool2d((1, 1)), # the number of input channels must be equal to the number of labels
# turn the 10 labels to a vector
nn.Flatten())

The global average pooling reduces the complexity of computation but also slows convergence.

GoogLeNet

GoogLeNet inherits the idea of block network, 1x1 convolutional layer, and global average pooling layer in NiN. The most important block in GoogLeNet is Inception block.

5

Fig. 5. Inception block(V1)

The inception block consists of multiple parallel pathways which extract different types of features from the input. Each pathway generates new output channels using all the channels of input data. Different pathways output different number of output channels but keep the shape of each channel the same as the input channel. Usually, the number of output channels is larger than the number of input channels. The output channels of different pathways are concatenated together to form the new input data.

In the inception block, the 1x1 kernel is used to merge channels so that the computaion complexity (it is equal to the number of parameters that the neural network can learn) can be reduced while the blue layer extract features from the input.

The output size of each pathway is the hyperparameter that can be modified in inception block.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

# Inception block
class Inception(nn.Module):
# c1--c4 which may be a list, are the number of output channels of each pathway
def __init__(self, in_channels, c1, c2, c3, c4, **kwargs):
super(Inception, self).__init__(**kwargs)
# pathway1, single 1x1 convolutional layer
self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
# pathway2,1x1 convolutional layer + 3x3 convolutional layer
self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
# pathway3,1x1 convolutional layer + 5x5 convolutional layer
self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
# pathway4,3x3 maxpooling layer + 1x1 convolutional layer
self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)

def forward(self, x):
# ReLU add nonlinearity for each layer
p1 = F.relu(self.p1_1(x))
p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
p4 = F.relu(self.p4_2(self.p4_1(x)))
# concatenate the output channels (dim 0 is batch size)
return torch.cat((p1, p2, p3, p4), dim=1)

The entire GoogLeNet uses 9 inception blocks and one global average pooling layer to generate the prediction. However, the global average pooling layer in GoogLeNet doesn't requires its output size to be equivalent with the number of labels in softmax because there is a dense layer behind it.

6

Fig. 6. GoogleNet

The whole GoogLeNet can be divided into 5 stages. After each stage, the length and width of the image are halved and the number of channels increases (often doubles). This is a idea followed by many subsequent neural networks. For instance, in the neural network we build below, the output shape of each stage is:

1
2
3
4
5
6
Sequential output shape:     torch.Size([1, 64, 24, 24])  # the first stage makes the output a quarter
Sequential output shape: torch.Size([1, 192, 12, 12])
Sequential output shape: torch.Size([1, 480, 6, 6])
Sequential output shape: torch.Size([1, 832, 3, 3])
Sequential output shape: torch.Size([1, 1024])
Linear output shape: torch.Size([1, 10])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# structure of GoogLeNet
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b2 = nn.Sequential(nn.Conv2d(64, 64, kernel_size=1),
nn.ReLU(),
nn.Conv2d(64, 192, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b3 = nn.Sequential(Inception(192, 64, (96, 128), (16, 32), 32),
Inception(256, 128, (128, 192), (32, 96), 64),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b4 = nn.Sequential(Inception(480, 192, (96, 208), (16, 48), 64),
Inception(512, 160, (112, 224), (24, 64), 64),
Inception(512, 128, (128, 256), (24, 64), 64),
Inception(512, 112, (144, 288), (32, 64), 64),
Inception(528, 256, (160, 320), (32, 128), 128),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

b5 = nn.Sequential(Inception(832, 256, (160, 320), (32, 128), 128),
Inception(832, 384, (192, 384), (48, 128), 128),
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten())

net = nn.Sequential(b1, b2, b3, b4, b5, nn.Linear(1024, 10))
X = torch.rand(size=(1, 1, 96, 96))

The above is the first version of GoogLeNet. In the subsequent versions, Google made different improvements to the inception block:

  • Inception-BN(V2): batch normalization;

  • Inception-V3: modify the size of kernel;

  • Inception-V4: use residual connections

Among them, V3 is used more.

Advantages of GoogLeNet: fewer model parameters and relatively low computational complexity.

Batch normalization

Batch normalization is a layer that will make the neural network converge faster as it keeps the input of hidden layer more stable. This is how it works:

$$\mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}$$

where $\hat{\boldsymbol{\mu}}_\mathcal{B}$ is the mean of small batch $\mathcal{B}$ and $\hat{\boldsymbol{\sigma}}_\mathcal{B}$ is the standard deviation of small batch $\mathcal{B}$. However, we just artificially make the distribution of $\mathbf{x}$ into a standard normal distribution. Therefore, we reserve two learnable parameters for the machine so that it can automatically adjust the distribution of $\mathbf{x}$. And that is scale ($\boldsymbol{\gamma}$) and shift ($\boldsymbol{\beta}$), which adjust variance and mean respectively. They have the same shape as $\mathbf{x}$.

For $\hat{\boldsymbol{\mu}}_\mathcal{B}$, it is easy to compute for each batch:

$$\hat{\boldsymbol{\mu}} _\mathcal{B} = \frac{1} {|\mathcal{B} |} \sum _{\mathbf{x} \in \mathcal{B}} \mathbf{ x }$$

But for $\hat{\boldsymbol{\sigma}}_\mathcal{B}$, we should add $\epsilon$ so that the formula don't divide by 0.

$$\hat{\boldsymbol {\sigma}} _\mathcal{B}^2 = \frac{1} {|\mathcal{B} |} \sum _{\mathbf{ x } \in \mathcal{ B }} (\mathbf{x} - \hat{\boldsymbol{\mu}} _{\mathcal{B}})^2 + \epsilon$$

Usually, we make $\epsilon$ 1e-5 or 1e-6.

Similar to dropout, we only make the batch normalization layer work when training. When making prediction, we use global mean and global variance to normalize the input of hiddden layers. The method to get them is slightly similar to that we use to get RTT in computer network:

1
2
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var

where mean and var are mean and variance the agent get in this small batch, momentum is a weight factor (range from 0 to 1).

$\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are still needed.

It is important to understand that all the mean and variance the agent computes are mean and variance of each feature or channel (for the convolutional layer, a channel represents a feature). That's why the shape of the second dimension of $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ is the same as the number of features for dense layers or the number of input channels for convolutional layers.

We put the batch normalization layer before the activation layer. Its function is a bit like dropout so we often don't use both layers at the same time.

It is not clear that why batch normalization works. Some researchers guess that it works as it add noise to each small batch. Those noise make the output of different layers more stable and more realistic. Hence, the gradients of bottom layers will larger and the model converge faster. But it doesn't change the accuracy of models.

Xavier makes the initial value of the model parameters more stable. Batch normalization makes the output of hidden layers (or the input of hidden layers) stable. It's actually a linear layer that modifies the data as we wish. Namely, we give the agent some human guidance.

ResNet

ResNet combined with batch normalization enables us to train a deeper neural network. Before ResNet, the model we train is based on non-nested function classes like the figure on the left. Namely, when we add a new block to complicate the model, we establish a new mapping relationship $y=g(f(x))$. However, we can't make sure that $g(f(x))$ is better than $f(x)$ since $g(f(x))$ can't cover all the zone that $f(x)$ cover. What's more, in a deep neural network, the gradients of the bottom layers are usually very small, which makes the bottom layers converge slowly.

The core idea of ResNet is to compute $y=g(f(x))+f(x)$ instead of $y=g(f(x))$ (Fig. 8). This small change makes a more complex network contain a simpler network (like the picture on the right). Therefore, when using ResNet, a deeper network always performs better than a simple network. In addition, ResNet provides a fast track for the bottom layer to get larger gradients (through $f(x)$). What's more, when the simple network $f(x)$ has already performed well, its gradients will be rather small, that is, its parameters will not change significantly and it will keep performing well in a deeper network.

7

Fig. 7. Non-nested and nested function classes

8

Fig. 8. Residual block, the weight layer could be dense or convolutional layer

The structure of ResNet is similar to GoogLeNet but it replace the inception block with residual block.

9

Fig. 9. ResNet-18

When counting the number of layer, we only count the convolutional layer and the dense layer.