As demonstrated in research, what is the trick that enables training a 10_000 layer deep neural network?
Answer
initialization
Relevant part of lecture
supplementary material
We take things for granted as they are, but there are many recent inventions that make the training of modern architectures possible. They include optimizers, activation functions and batchnorm. But there is a single technique without which training a 10_000 layer deep neural network would not be possible and that would make shallower architectures much harder to train, even with all the other goodies at hand. And that is initialization.