It cannot be used for online training (batch size of 1). Anytime we have a small batch size we either will be unable to train or the training will be unstable. It will also be problematic for an RNN - how do you normalize a batch where each sequence can contain a variable number of words and where weights are tied?