Question 8/10 v3 lecture 8

As demonstrated in research, what is the trick that enables training a 10_000 layer deep neural network?



Relevant part of lecture

supplementary material

We take things for granted as they are, but there are many recent inventions that make the training of modern architectures possible. They include optimizers, activation functions and batchnorm. But there is a single technique without which training a 10_000 layer deep neural network would not be possible and that would make shallower architectures much harder to train, even with all the other goodies at hand. And that is initialization.

Links to relevant papers:

Delving Deep into Rectifiers (a paper introducing both the kaiming initialization and relu!)
Fixup Initialization (how do you train a 10_000 layer beast?!)
Understanding the difficulty of training deep feedforward neural networks