XResNet is based on the Bag of Tricks paper. What are the main improvements that it incorporates?
Answer
The improvements that XResNet adds over ResNet are:
Modified stem (3 3x3 convs instead of 1 7x7)
BatchNorm is sometimes initialized to have weights of 1 and sometimes of 0 (gives us a great starting point for training our model, we initially do not add anything to the identity block)
If we do downsampling, the identity path consists of AvgPool(2x2) and 1x1 Conv
Stride 2 got moved to the 3x3 (from 1x1, where it was throwing out 75% of the information)
Relevant part of lecture
supplementary material
Later in the lecture: Fp16 sometime trains a little bit better than fp32 - maybe it has some regularazing effect? Generally, the results are very close together