What is a very important aspect that is often overlooked when doing transfer learning? What is one of the main culprits of getting weird results in this scenario?
Answer
BatchNorm! If we freeze entire modules of our model, the layers have been trained to work with inputs of some mean and standard deviation, but these are no longer the inputs they are getting! The way to correct this is, anytime you are doing partial layer training, is to fine tune the BatchNorm layers on our new data.