Question 28/30 v3 lecture 10

How does LayerNorm differ from BatchNorm?


The mean and standard deviation are taken across all channels of each input (image for example). As such, we no longer keep track of running averages. LayerNorm helps, but it is not as good as BatchNorm (among other things, it removes some information - think the sunny vs foggy picture scenario, but this is what we have to use for RNNs).

Relevant part of lecture

supplementary material

Layer Normalization by Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton Summary of BatchNorm, LayerNorm, InstanceNorm and GroupNorm