How does LayerNorm differ from BatchNorm?

The mean and standard deviation are taken across all channels of each input (image for example). As such, we no longer keep track of running averages. LayerNorm helps, but it is not as good as BatchNorm (among other things, it removes some information - think the sunny vs foggy picture scenario, but this is what we have to use for RNNs).