PyTorch
Installation and import
Since PyTorch is based on the Torch library, the name of package is actually called torch:
1 | # Installation |
1 | # import |
Basic concept
The basic data type in PyTorch is also tensor which is almost the same as the tensor in TensorFlow. PyTorch provides two high-level features:
- Tensor computing with strong acceleration via GPUs (NumPy only runs on CPUs);
- Deep neural networks built on tape-based automatic differentiation system.
Both features have much in common with TensorFlow. However, compared to TensorFlow, the api provided by PyTorch are more closer to NumPy.
Data manipulation
To create a tensor, the operations we used are almost the same as those in NumPy:
1 | import torch |
Unless otherwise specified, new tensors are stored in main memory and designated for CPU-based computation. See more indexing, slicing, operations and broadcasting in NumPy.
Though most operations are similar, there are still some differences. We use torch.cat
to concatenate multiple tensors together:
1 | X = torch.arange(12, dtype=torch.float32).reshape(-1, 4) |
In addition to .reshape
, we can also use .view
to get object of different shapes. There are some differences between the two operations. .view
returns a view of the original tensor while .shape
may return a copy of the original tensor. .shape
only returns the view when the inputs are contiguous in memory.
Saving memory
X = X +Y
will create a new object and allocate memory while X[:] = X + Y
or X += Y
will perform in-place operation.
Conversion to other python objects
1 | A = X.numpy() # torch.Tensor->numpy.ndarray |
Linear algebra function
A.T
shares memory withA
;A.clone()
returns a new object with the same elements ofA
;A.mean([axes])
andA.sum([axes])
return the mean or sum of [axes] inA
. In general, [axes] will missing from the shape of the output, but we can addkeepdims=True
which will make the shape of [axes] 1, to remain [axes];A.cumsum([axes])
calculate the cumulative sum of elements ofA
along some axes;torch.dot(), torch.mv(), torch.mm()
(A @ B
is also legal) calculate v-v products, m-v products and m-m products respectively;
A*B
or $A\odot B$ is called Hadamard product.
Norms
Norms ($||x||$) are often used to measure the length or size of each vector in a vector space (or matrix). They are scalars that satisfy:
- Non-negativity;
- Homogeneity;
- Triangle inequality
For vectors, $\ell{_2}$ norms measure the (Eucilidean) length of vectors:
$$\vert |x|\vert_2=\sqrt{\sum\limits_{i=1}^{n}x_i^2}$$
1 | torch.norm(x) |
$\ell{_1}$ norms are called Manhattan distance:
$$\vert |x|\vert_1=\sqrt{\sum\limits_{i=1}^{n}|x|}$$
1 | torch.abs(x).sum() |
For matrices, we often use the Frobenius norm, which is the same as the $\ell_2$ norm:
1 | torch.norm(X) |
Autograd
See Matrix derivative to know more about matrix derivative.
Unlike TensorFlow, PyTorch use implicit construction to produce a computation graph, which allows us to simply use its api and don't have to declare the computation graph explicitly like Autograd in Tensorflow. Whenever we want to compute the derivative of a certain argument, we only need 4 steps in PyTorch:
- Attach gradients to those variables with respect to which we desire derivatives:
1
2# Create independent variable and the space to store derivatives
x = torch.arange(4.0, requires_grad=True) # or x.requires_grad_(True) - Record the computation of the target value (dependent variable):
1
y = f(a) # f is the function we define
- Execute the back propagation function:
1
y.backward()
- Access the resulting gradient:
1
print(x.grad)
The steps above only work when the dependent variable is a scalar. For non-scalar variables, we sometimes turn them to scalar variables by summing all elements together, like y.sum().backward()
. This works because we add the gradients of a specific model parameter together finally. More commonly, we will use a certain row vector $v^T$ to turn $\vec{y}$ to a scalar:
1 | # this is the same as: y.sum().backward(), because torch.dot(torch.ones(len(y)), y) = y.sum() |
Since neural networks are always computed batch by batch, the result of
x.grad
will be accumulated. To reset the gradient buffer, we can callx.grad.zero()
.
Detaching computation
If $z=f(x,y)$ and $y=g(x)$, but we only want to focus on the direct influence of $x$ on $z$, we can create a new variable that detaches the connection between $x$ and $y$:
1 | y = g(x) |
Updating parameters
The Computation graph (tree) will be built implicitly (if requires_grad=True
) whenever we operate the parameters that we want to optimize. Since the parameters are always in leaf nodes, we have to detach it from the graph, otherwise, the whole graph will go wrong.
PyTorch implements a dynamic graph mechanism. Specifically, the computation graph is constructed during forward propagation, and is destroyed during back propagation. More specifically, the computation graph is destroyed when calling
backward()
, leaving only parameters in leaf nodes.
The method PyTorch
used is with torch.no_grad():
, which makes requires_grad=False
when entering and requires_grad=True
when leaving:
1 | # General usage |
Net constructing
All layers, blocks or nets in PyTorch are subclasses of nn.Module
. We can define our own blocks by inheriting nn.Module
and overloading the __init__
and forward
functions.
1 | from torch.nn import functional as F |
In general, we just need to define the structure of our blocks in
__init__
and compute the output inforward
.forward
points to__call__
, sonet(X)
is euqal tonet.forward(X)
.
Sequential
nn.Sequential
is the built-in subclass of nn.Module
. Its working principle is very simple, which just simply connects different blocks. We can define it by ourselves:
1 | class MySequential(nn.Module): |
MySequential
will compute in the order of nn.Linear
, nn.ReLU
and nn.Linear
.
._modules
is anOrderedDict
defined innn.Module
. It will store our blocks in order. All the arguments ofSequential
should be the subclass ofnn.Module
.In PyTorch, the activation is also a layer though it doesn't have any model parameters.
Parameter management
The type of model parameters in PyTorch is nn.Parameter
which is a compound object containing values (Tensor
), gradients (grad
) and extra information. grad
works when we call .backward
and requires_grad=True
.
Parameter visiting
For blocks that define model parameters, we can use .state_dict()
which returns a dictionary to visit the model parameters:
1 | net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1)) |
Sequential
allows use to visit each block like alist
. If we justprint(net)
, python will output the structure ofnet
.
Parameter initialization
We can define our own function to initialize the model parameters:
1 | def init_normal(m): |
m
is the subclass ofnn.Module
.net.apply
asks PyTorch to callinit_normal
for each block. Only the block with model parameters will initialize its parameters. (So far, only the built-innn.Linear
has defined model parameters).
Since each block is the subclass of nn.Module
, we could also initialize each block respectively:
1 | net[0].apply(init_normal) |
Direct initialization is also possible, like:
1
2
3
4 def my_init(m):
if type(m) == nn.Linear:
nn.init.uniform_(m.weight, -10, 10)
m.weight.data *= m.weight.data.abs() >= 5We must not initialize all
weight
using the same value. If we do so, all the neurons in a layer are computing the same thing, that is, they will output the same value and their gradients will also be the same.
Shared layer
Since instances of custom classes are mutable objects in python, we can make two layers share their parameters by passing the same object to PyTorch:
1 | shared = nn.Linear(8, 8) |
Block with parameters
We could define parameters for our own blocks. Since backward
will work for all Tensor
, we don't have to deal with backward
. We just need to define our parameters and set requires_grad=True
.
1 | class MyLinear(nn.Module): |
Saving & loading parameters
We can save a tensor, a tensor list or a dictionary using torch.save
and load them using torch.load
:
1 | x = torch.arange(4) |
For a model, PyTorch will save its parameters rather than the whole model:
1 | class MLP(nn.Module): |
When we need to load the model, we should rebuild the same structure and load the parameters:
1 | clone = MLP() |
Training in GPUs
Tensors are created on CPU by default.
1 | x = torch.tensor([1, 2, 3]) |
To create a tensor in GPU, we must specify the GPU we use:
1 | X = torch.ones(2, 3, device=torch.device('cuda')) |
cuda
means GPU in PyTorch. All Nvidia GPUs owned by the computer are organized into an array.cuda
orcuda:0
represent the first GPU andcuda:1
represent the second GPU. We can use!nvidia-smi
to get the information of Nvidia GPUs in our computer andtorch.cuda.device_count()
to get the number of Nvidia GPUs of our computer.
If we want to operate different tensors together, we must make sure that they are stored in the same GPU, otherwise, PyTorch will throw an exception because moving data from CPU to GPU or from GPU to another GPU is time-consuming.
1 | Y = torch.rand(2, 3, device=torch.device('cuda')) |
Neural networks in GPU
Similarly, we can put the parameters of neural networks in GPU:
1 | net = nn.Sequential(nn.Linear(3, 1)) |
Both parameters and data should be stored in the same device. Inadvertently moving data from one device to another can significantly degrade performance, which is what we need to pay attention to. For example, report data to the user on the command line or log it in a NumPy ndarray, both of which will cause data movement from GPU to CPU.
To enable training the model in GPUs, we should install CUDA and NVIDIA Driver in CUDA Toolkit and install the corresponding GPU version of pytorch in PyTorch.