During training, a batch norm layer will calculate a mean and standard deviation. It will subtract the mean from the activations and scale them by their standard deviation. After this, it will multiply the activations by a learned parameter and add the value of another learned parameter.

What are the means and standard deviation values that are used for normalization during training calculated on?

Answer

They are calculated only on the batch that the model is being trained on

Relevant part of lecture

supplementary material

Original paper on BatchNorm