Training using half-precision floating point (fp16) can be up to 3x faster. When training with fp16, are all calculations done using half-precision floats?
Answer
No, we cannot do that, because fp16 is not accurate enough, its hard to get good gradients with fp16 (things often round off to zero). Instead, we do the heavy lifting in fp16 - the forward pass and the backward pass. Everywhere else, we convert tensors to fp32 and perform the operations in full precision (calculating the loss, subtracting gradients).
Relevant part of lecture
supplementary material
Later in the lecture: Fp16 sometime trains a little bit better than fp32 - maybe it has some regularazing effect? Generally, the results are very close together