Why fbank use conv2d in speech translation

I found that in this speechbrain/recipes/Fisher-Callhome-Spanish/ST/transformer/train.py , but I dont know why use conv2d for fbank.

Because I wish to be consistent with ASR recipes. To perform a downsampling with a factor of 4. You could also follow ESPnet using 1d CNN 2 times. It has various way to implement it. But the key point in here is that we downsampling FBank with a factor of 4. Hope it answers your problem.