Filterbank non-constant amplitudes

Plotting the triangular filterbank, I noticed that the amplitudes at the central frequencies are not
constant. Thus, even though having a 50% overlap, I think the filters don’t meet the COLA constraint and as a result trying to reconstruct the signal from the framed segments will be distorted.

Is that ok? What do you think?

This is ultra weird as we checked this part many many times. Could you share the code that produced this ?

import torch
import matplotlib.pyplot as plt
from speechbrain.dataio.dataio import read_audio
from speechbrain.processing.features import STFT

from speechbrain.processing.features import spectral_magnitude
from speechbrain.processing.features import Filterbank

# %%

signal, _ = read_audio('spk1_snt1.wav') 
signal = signal.unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=25, hop_length=10, n_fft=400)
signal_STFT = compute_STFT(signal)

compute_fbanks = Filterbank(n_mels=20)
STFT = compute_STFT(signal)
mag = spectral_magnitude(STFT)

fbanks, fb_mat = compute_fbanks(mag)
plt.figure(figsize=(8, 4), dpi=100)

I edited the Filterbank forward function to return the fbank_matrix as well
i.e, /speechbrain/processing/

return fbanks, fbank_matrix

Trying the Gaussian FB,
the first filter at DC starts at a value that is not zero.

compute_fbanks_gauss = Filterbank(n_mels=10, n_fft=1024, filter_shape="gaussian")

The higher the fft points used in the calculation of the STFT, the smoother the curves at lower frequencies get. However, the gaussian filter starts at: tensor(0.1353) instead of 0.


Ok this looks weird, we are looking into it. Please do not hesitate to investigate a bit and she your findings. Hard to keep up with all the questions :frowning:

I finally took a look at the fbank issue. It isn’t actually a real issue as it naturally depends on the limited resolution of the FFT. Let me try to explain it with one example.

When we compute the STFT, we have a certain resolution in frequency that depends on the FFT points. By default, we use nfft=400 because it corresponds to the standard window length of 25 ms. If the fs=16000, we have to work on a grid of points that corresponds to the following frequencies (from 0 to 8 khz):
tensor([[ 0., 40., 80., 120., 160, 180, ..., 7920., 7960., 8000.],

Let’s assume now that we did all the computations and we have to set the maximum of the triangular filter to f_central = 89.9678 Hz.
When you compute the filter values at the discrete points in the frequency grid you can have:

0.0000, 0.4482, 0.9064, 0.6554, 0.2072, 0.0000, 0.0000,

The maximum of “1.0000” is something between the third and the fourth frequency point but you cannot see it with the limited resolution of the fft.
If you increase the FFT resolution (see code below) you indeed have numbers whose max number is closer than 1:

  ...  0.7878, 0.8316, 0.8754, 0.9191, 0.9629, 0.9933, 0.9496, 0.9058, 0.8620,
        0.8182, 0.7745....

One alternative approach could be to force each central frequency of the filter to exactly coincide with one of the frequencies available in the frequency grid (e.g, if the real central freq is 89.96 Hz we might put it at 80 Hz). This is not ideal too because it means that the central frequencies of the filters depend on the fft resolution. Moreover, this way the allocated filterbank is going to deviate from the expected one based on the mel scale. Both solutions lead to an unideal fbank matrix due to the limited resolution of the FFT.

1 Like