PyTorch

Installation and import

Since PyTorch is based on the Torch library, the name of package is actually called torch:

1
2
3
4
# Installation
pip install torch
or
conda install torch
1
2
# import
import torch

Basic concept

The basic data type in PyTorch is also tensor which is almost the same as the tensor in TensorFlow. PyTorch provides two high-level features:

  • Tensor computing with strong acceleration via GPUs (NumPy only runs on CPUs);
  • Deep neural networks built on tape-based automatic differentiation system.

Both features have much in common with TensorFlow. However, compared to TensorFlow, the api provided by PyTorch are more closer to NumPy.

Data manipulation

To create a tensor, the operations we used are almost the same as those in NumPy:

1
2
3
4
5
6
7
import torch
x = torch.arange(12.) # float32
x = torch.arange(12) # int64
x = torch.zeros((2, 3, 4))
x = torch.ones((2, 3, 4))
x = torch.randn(3, 4) # random elements drawn from a standard normal distribution
x = torch.tensor([[1, 2], [2, 3]])

Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation. See more indexing, slicing, operations and broadcasting in NumPy.

Though most operations are similar, there are still some differences. We use torch.cat to concatenate multiple tensors together:

1
2
3
4
5
6
7
8
9
10
11
12
X = torch.arange(12, dtype=torch.float32).reshape(-1, 4)
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0)
'''
We get:
tensor([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 2., 1., 4., 3.],
[ 1., 2., 3., 4.],
[ 4., 3., 2., 1.]])
'''

In addition to .reshape, we can also use .view to get object of different shapes. There are some differences between the two operations. .view returns a view of the original tensor while .shape may return a copy of the original tensor. .shape only returns the view when the inputs are contiguous in memory.

Saving memory

X = X +Y will create a new object and allocate memory while X[:] = X + Y or X += Y will perform in-place operation.

Conversion to other python objects

1
2
3
A = X.numpy() # torch.Tensor->numpy.ndarray
B = torch.from_numpy(A) # numpy.ndarray->torch.Tensor
b = torch.tensor([2.2]).item() # size-1 tensor->python scalar

Linear algebra function

  • A.T shares memory with A;
  • A.clone() returns a new object with the same elements of A;
  • A.mean([axes]) and A.sum([axes]) return the mean or sum of [axes] in A. In general, [axes] will missing from the shape of the output, but we can add keepdims=True which will make the shape of [axes] 1, to remain [axes];
  • A.cumsum([axes]) calculate the cumulative sum of elements of A along some axes;
  • torch.dot(), torch.mv(), torch.mm() (A @ B is also legal) calculate v-v products, m-v products and m-m products respectively;

A*B or $A\odot B$ is called Hadamard product.

Norms

Norms ($||x||$) are often used to measure the length or size of each vector in a vector space (or matrix). They are scalars that satisfy:

  • Non-negativity;
  • Homogeneity;
  • Triangle inequality

For vectors, $\ell{_2}$ norms measure the (Eucilidean) length of vectors:

$$\vert |x|\vert_2=\sqrt{\sum\limits_{i=1}^{n}x_i^2}$$

1
torch.norm(x)

$\ell{_1}$ norms are called Manhattan distance:

$$\vert |x|\vert_1=\sqrt{\sum\limits_{i=1}^{n}|x|}$$

1
torch.abs(x).sum()

For matrices, we often use the Frobenius norm, which is the same as the $\ell_2$ norm:

1
torch.norm(X)

Autograd

See Matrix derivative to know more about matrix derivative.

Unlike TensorFlow, PyTorch use implicit construction to produce a computation graph, which allows us to simply use its api and don't have to declare the computation graph explicitly like Autograd in Tensorflow. Whenever we want to compute the derivative of a certain argument, we only need 4 steps in PyTorch:

  1. Attach gradients to those variables with respect to which we desire derivatives:
    1
    2
    # Create independent variable and the space to store derivatives
    x = torch.arange(4.0, requires_grad=True) # or x.requires_grad_(True)
  2. Record the computation of the target value (dependent variable):
    1
    y = f(a) # f is the function we define
  3. Execute the back propagation function:
    1
    y.backward()
  4. Access the resulting gradient:
    1
    print(x.grad)

The steps above only work when the dependent variable is a scalar. For non-scalar variables, we sometimes turn them to scalar variables by summing all elements together, like y.sum().backward(). This works because we add the gradients of a specific model parameter together finally. More commonly, we will use a certain row vector $v^T$ to turn $\vec{y}$ to a scalar:

1
2
# this is the same as: y.sum().backward(), because torch.dot(torch.ones(len(y)), y) = y.sum()
y.backward(gradient=torch.ones(len(y)))

Since neural networks are always computed batch by batch, the result of x.grad will be accumulated. To reset the gradient buffer, we can call x.grad.zero().

Detaching computation

If $z=f(x,y)$ and $y=g(x)$, but we only want to focus on the direct influence of $x$ on $z$, we can create a new variable that detaches the connection between $x$ and $y$:

1
2
3
4
5
6
y = g(x)
u = y.detach() # remove y from the computation graph of z
z = x * u # u is no longer a function of x but y is still a function of x

z.sum().backward()
x.grad == u # the value is true

Updating parameters

The Computation graph (tree) will be built implicitly (if requires_grad=True) whenever we operate the parameters that we want to optimize. Since the parameters are always in leaf nodes, we have to detach it from the graph, otherwise, the whole graph will go wrong.

PyTorch implements a dynamic graph mechanism. Specifically, the computation graph is constructed during forward propagation, and is destroyed during back propagation. More specifically, the computation graph is destroyed when calling backward(), leaving only parameters in leaf nodes.

The method PyTorch used is with torch.no_grad():, which makes requires_grad=False when entering and requires_grad=True when leaving:

1
2
3
4
5
# General usage
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size
param.grad.zero_()

Net constructing

All layers, blocks or nets in PyTorch are subclasses of nn.Module. We can define our own blocks by inheriting nn.Module and overloading the __init__ and forward functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self):
super().__init__() # we must call the __init__ function of Module so that we can inherit its parameters
self.hidden = nn.Linear(20, 256) # a hidden layer
self.out = nn.Linear(256, 10) # output layer

# forward propagation: get X and produce output
def forward(self, X):
# F defines some basic functions
return self.out(F.relu(self.hidden(X)))
net = MLP()
net(X)

In general, we just need to define the structure of our blocks in __init__ and compute the output in forward. forward points to __call__, so net(X) is euqal to net.forward(X).

Sequential

nn.Sequential is the built-in subclass of nn.Module. Its working principle is very simple, which just simply connects different blocks. We can define it by ourselves:

1
2
3
4
5
6
7
8
9
10
11
12
13
class MySequential(nn.Module):
def __init__(self, *args):
super().__init__()
for block in args:
self._modules[block] = block

def forward(self, X):
for block in self._modules.values():
X = block(X)
return X

net = MySequential(nn.Linear(20, 256), nn.ReLU(), nn.Linear(256, 10))
net(X)

MySequential will compute in the order of nn.Linear, nn.ReLU and nn.Linear.

._modules is an OrderedDict defined in nn.Module. It will store our blocks in order. All the arguments of Sequential should be the subclass of nn.Module.

In PyTorch, the activation is also a layer though it doesn't have any model parameters.

Parameter management

The type of model parameters in PyTorch is nn.Parameter which is a compound object containing values (Tensor), gradients (grad) and extra information. grad works when we call .backward and requires_grad=True.

Parameter visiting

For blocks that define model parameters, we can use .state_dict() which returns a dictionary to visit the model parameters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
print(net[2].state_dict())
'''
OrderedDict([('weight', tensor([[ 0.3016, -0.1901, -0.1991, -0.1220, 0.1121, -0.1424, -0.3060, 0.3400]])), ('bias', tensor([-0.0291]))])
'''
print(type(net[2].bias))
print(net[2].bias)
print(net[2].bias.data)
'''
<class 'torch.nn.parameter.Parameter'>
Parameter containing:
tensor([-0.0291], requires_grad=True)
tensor([-0.0291])
'''

Sequential allows use to visit each block like a list. If we just print(net), python will output the structure of net.

Parameter initialization

We can define our own function to initialize the model parameters:

1
2
3
4
5
def init_normal(m):
if type(m) == nn.Linear:
nn.init.normal_(m.weight, mean=0, std=0.01)
nn.init.zeros_(m.bias)
net.apply(init_normal)

m is the subclass of nn.Module. net.apply asks PyTorch to call init_normal for each block. Only the block with model parameters will initialize its parameters. (So far, only the built-in nn.Linear has defined model parameters).

Since each block is the subclass of nn.Module, we could also initialize each block respectively:

1
net[0].apply(init_normal)

Direct initialization is also possible, like:

1
2
3
4
def my_init(m):
if type(m) == nn.Linear:
nn.init.uniform_(m.weight, -10, 10)
m.weight.data *= m.weight.data.abs() >= 5

We must not initialize all weight using the same value. If we do so, all the neurons in a layer are computing the same thing, that is, they will output the same value and their gradients will also be the same.

Shared layer

Since instances of custom classes are mutable objects in python, we can make two layers share their parameters by passing the same object to PyTorch:

1
2
3
4
5
shared = nn.Linear(8, 8)
net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
shared, nn.ReLU(),
shared, nn.ReLU(),
nn.Linear(8, 1))

Block with parameters

We could define parameters for our own blocks. Since backward will work for all Tensor, we don't have to deal with backward. We just need to define our parameters and set requires_grad=True.

1
2
3
4
5
6
7
8
class MyLinear(nn.Module):
def __init__(self, in_units, units):
super().__init__()
self.weight = nn.Parameter(torch.randn(in_units, units))
self.bias = nn.Parameter(torch.randn(units,))
def forward(self, X):
linear = torch.matmul(X, self.weight.data) + self.bias.data
return F.relu(linear)

Saving & loading parameters

We can save a tensor, a tensor list or a dictionary using torch.save and load them using torch.load:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x = torch.arange(4)
y = torch.arange(5)
z = {'x': x, 'y': y}
torch.save(x, 'x-file')
torch.save([x, y], 'xy-file')
torch.save(z, 'z-dict')

x1 = torch.load('x-file')
xx, yy = torch.load('xy-file')
zz = torch.load('z-dict')

x1 == x, xx == x, yy == y, zz['x'] == x, zz['y'] == y
'''
(tensor([True, True, True, True]),
tensor([True, True, True, True]),
tensor([True, True, True, True, True]),
tensor([True, True, True, True]),
tensor([True, True, True, True, True]))
'''

For a model, PyTorch will save its parameters rather than the whole model:

1
2
3
4
5
6
7
8
9
10
11
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.Linear(20, 256)
self.output = nn.Linear(256, 10)

def forward(self, x):
return self.output(F.relu(self.hidden(x)))

net = MLP()
torch.save(net.state_dict(), 'mlp.params')

When we need to load the model, we should rebuild the same structure and load the parameters:

1
2
clone = MLP()
clone.load_state_dict(torch.load('mlp.params'))

Training in GPUs

Tensors are created on CPU by default.

1
2
3
4
5
x = torch.tensor([1, 2, 3])
x.device
'''
device(type='cpu')
'''

To create a tensor in GPU, we must specify the GPU we use:

1
2
3
4
5
X = torch.ones(2, 3, device=torch.device('cuda'))
X.device
'''
device(type='cuda', index=0)
'''

cuda means GPU in PyTorch. All Nvidia GPUs owned by the computer are organized into an array. cuda or cuda:0 represent the first GPU and cuda:1 represent the second GPU. We can use !nvidia-smi to get the information of Nvidia GPUs in our computer and torch.cuda.device_count() to get the number of Nvidia GPUs of our computer.

If we want to operate different tensors together, we must make sure that they are stored in the same GPU, otherwise, PyTorch will throw an exception because moving data from CPU to GPU or from GPU to another GPU is time-consuming.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Y = torch.rand(2, 3, device=torch.device('cuda'))
Z = torch.rand(2, 3) # create in CPU
X = Z.cuda(0) # create a new tensor in GPU 0
print(X)
print(Z)
'''
tensor([[0.3206, 0.4276, 0.8653],
[0.3276, 0.4867, 0.1320]], device='cuda:0')
tensor([[0.3206, 0.4276, 0.8653],
[0.3276, 0.4867, 0.1320]])
'''
# now we can add Y and X
X + Y
'''
tensor([[0.8401, 0.6486, 1.8582],
[1.1932, 0.8063, 1.1152]], device='cuda:0')
'''
# if we add Y and Z
Y + Z
'''
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
'''

Neural networks in GPU

Similarly, we can put the parameters of neural networks in GPU:

1
2
3
4
5
6
7
net = nn.Sequential(nn.Linear(3, 1))
net = net.to(device=torch.device('cuda'))
net(X)
'''
tensor([[-0.1712],
[ 0.1713]], device='cuda:0', grad_fn=<AddmmBackward0>)
'''

Both parameters and data should be stored in the same device. Inadvertently moving data from one device to another can significantly degrade performance, which is what we need to pay attention to. For example, report data to the user on the command line or log it in a NumPy ndarray, both of which will cause data movement from GPU to CPU.

To enable training the model in GPUs, we should install CUDA and NVIDIA Driver in CUDA Toolkit and install the corresponding GPU version of pytorch in PyTorch.