It as a type of regularization used for training RNNs. It is very similar to weight decay but rather than adding some multiplier times the sum of squares of the weights to the loss, we add some multiplier times the sum of squares of the activations. In other words, we are not trying to decrease the weights, but decrease total activations.