I finally took a look at the fbank issue. It isn’t actually a real issue as it naturally depends on the limited resolution of the FFT. Let me try to explain it with one example.
When we compute the STFT, we have a certain resolution in frequency that depends on the FFT points. By default, we use nfft=400
because it corresponds to the standard window length of 25 ms. If the fs=16000, we have to work on a grid of points that corresponds to the following frequencies (from 0 to 8 khz):
tensor([[ 0., 40., 80., 120., 160, 180, ..., 7920., 7960., 8000.],
Let’s assume now that we did all the computations and we have to set the maximum of the triangular filter to f_central = 89.9678 Hz.
When you compute the filter values at the discrete points in the frequency grid you can have:
0.0000, 0.4482, 0.9064, 0.6554, 0.2072, 0.0000, 0.0000,
The maximum of “1.0000” is something between the third and the fourth frequency point but you cannot see it with the limited resolution of the fft.
If you increase the FFT resolution (see code below) you indeed have numbers whose max number is closer than 1:
... 0.7878, 0.8316, 0.8754, 0.9191, 0.9629, 0.9933, 0.9496, 0.9058, 0.8620,
0.8182, 0.7745....
One alternative approach could be to force each central frequency of the filter to exactly coincide with one of the frequencies available in the frequency grid (e.g, if the real central freq is 89.96 Hz we might put it at 80 Hz). This is not ideal too because it means that the central frequencies of the filters depend on the fft resolution. Moreover, this way the allocated filterbank is going to deviate from the expected one based on the mel scale. Both solutions lead to an unideal fbank matrix due to the limited resolution of the FFT.