MFCC output shape

Am trying to compute MFCC of an audio signal. Output of SB’s MFCC is of dims (len,660) when the expected dims of a traditional MFCC is (len, num_of_mfcc_coeffcients)

1.) I just want to know how i should be understanding the output of SB’s MFCC.

from speechbrain.lobes.features import MFCC as sb_mfcc
from speechbrain.dataio.dataio import read_audio, write_audio
sbmfcc = sb_mfcc()
sb_audio = read_audio(file_path)
sb_full = sbmfcc(sb_audio.unsqueeze(0)).squeeze(0)
sb_full.shape
> (<len>,660)

2.) Is this output not comparable against torchtransform.mfcc’s output ? Is there any reason why SB has preferred to do a self implementation of these featurizers ? Are you trying to make it learnable. Would be great if you could elaborate on it.

Hi.

So the reason for this shape can be found in the documentation of this functionality. Basically, by default, 5 frames are taken from the left and right side.

We decided to go with our own implementation for many reasons. 1) It is user and research friendly → One can jump into the code and modify stuffs on the fly according to their need; 3) We can indeed learn them :smiley:

Thank you, would you be able to confirm whether this is the MFCC documentation you had mentioned, or if there is something more ?

Indeed, look at left and right frames