If you have a single channel image and are running a 3x3 conv on it, why is opting for 8 channels problematic?

Answer

8 channels are too many - effectively we go from 9 numbers describing a patch to 8, the decrease is too small (we are not doing any useful computation, all we are doing is reordering the numbers)

Relevant part of lecture

supplementary material

An interesting discussion on this subject arose on Twitter. Some of the highlights are:

A conv layer is meant to act as an encoder
Very interesting work in analyzing the performance of the stem has been carried out in the Bag of Tricks for Image Classification paper.
It is interesting to analyze the network of some classical architectures through this perspective.