Multiple GPUs and Parallelism

Batch size

Batch size is related to sample diversity. When computing the loss of a batch, we actually take the average of the total loss. Therefore, no matter how large the batch size is, we only update the model parameters with one sample on average in each iteration. In this case, the diversity of samples is rather important. The more diverse the sample is, the more efficient updates we can get in each iteration. For example, if all the samples in a batch are the same, it works as if each batch has only one sample.

When the total number of samples is constant, the larger the batch size is, the fewer times the model parameter is updated in one iteration. Hence, to achieve the same performance, we have to enlarge learning rate when we enlarge the batch size.

In theory, one sample in a batch will produce the best model performance but it is time-consuming.