Use pretrained SpeechBrain model as Loss

Hello,

I’m trying to use the embeddings in

import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
embeddings = classifier.encode_batch(signal)

as part of my custom loss function. More precisely, I want to backpropagate through classifier. But I don’t know how to make classifier.encode_batch(signal) a tensor with grad_fn = True, or differently said:
Is it possible to backpropagate through the preTrained models of SpeechBrain?

Best regards,

@Gastron might know more about that.

Load the EncoderClassifier with freeze_params=False, change:

to
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", freeze_params=False)

If I do that, say

import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", freeze_params=False)
signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
embeddings = classifier.encode_batch(signal)

then I get an error:

*Traceback (most recent call last):
File “/export/home/lay/Programme/professional pycharm/pycharm-2021.2.1/plugins/python/helpers/pydev/pydevd.py”, line 1483, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File “/export/home/lay/Programme/professional pycharm/pycharm-2021.2.1/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py”, line 18, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “/export/home/lay/PycharmProjects/df-conformer/toyexampleSpeechBrain.py”, line 82, in
embeddings = classifier.encode_batch(signal)
File “/usr/local/lib/python3.8/dist-packages/speechbrain/pretrained/interfaces.py”, line 680, in encode_batch
embeddings = self.modules.embedding_model(feats, wav_lens)
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/speechbrain/lobes/models/ECAPA_TDNN.py”, line 476, in forward
x = self.asp_bn(x)
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/speechbrain/nnet/normalization.py”, line 91, in forward
x_n = self.norm(x)
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1051, in _call_impl
return forward_call(*input, **kwargs)
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py”, line 167, in forward
return F.batch_norm(
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/functional.py”, line 2279, in batch_norm
_verify_batch_size(input.size())
File “/export/home/lay/.local/lib/python3.8/site-packages/torch/nn/functional.py”, line 2247, in _verify_batch_size
raise ValueError(“Expected more than 1 value per channel when training, got input size {}”.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 6144, 1])

Process finished with exit code 1
*

This method needs a batch of audio. If you just want to run a single signal through, you can fake a batch by doing: signal.unsqueeze(0), e.g. embeddings = classifier.encode_batch(signal.unsqueeze(0))

After

signal, fs =torchaudio.load('samples/audio_samples/example1.wav')

signal is of type tensor.torch with size (1, 64000) in my case. So it is a batch of audio with batch size = Unsqueezing doesn’t help here. When I run your suggestion with unsqueeze() then I just get another error…

Oh right, my bad. I looked at your error message in more detail in it looks like the error actually comes from batch norm - I think the problem is that the default batch norm does not work with batch size 1 in training (cannot compute variance). SpeechBrain actually has a flag to normalize over the time axis as well as the batch axis, which should work. However it is not possible to directly set this flag when defining the model that we’re talking about. Also, with batch norm, it would still be theoretically better to have a proper batch of samples.

But if you want to hack it to work with single samples, you could just change this line in the SpeechBrain library (if you’ve installed in editable mode):

to:

    combine_batch_time=True,

Thanks for the answer, but it still doesn’t work. Just to clarify I don’t want to train the model ecapa-tdnn. I just want to know the gradients

decapa-tdnn(x)/dx_i

In order to do so, I need to backpropagate through ecapa-tdnn.
When I change to

combine_batch_time=True,

then I just get another error:

Traceback (most recent call last):
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 465, in forward
    x = layer(x, lengths=lengths)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'lengths'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/home/lay/Programme/professional pycharm/pycharm-2021.2.1/plugins/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/export/home/lay/Programme/professional pycharm/pycharm-2021.2.1/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/toy.py", line 12, in <module>
    embeddings = classifier.encode_batch(signal)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/speechbrain/pretrained/interfaces.py", line 680, in encode_batch
    embeddings = self.modules.embedding_model(feats, wav_lens)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 467, in forward
    x = layer(x)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/speechbrain/lobes/models/ECAPA_TDNN.py", line 72, in forward
    return self.norm(self.activation(self.conv(x)))
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/speechbrain/nnet/normalization.py", line 91, in forward
    x_n = self.norm(x)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/functional.py", line 2282, in batch_norm
    return torch.batch_norm(
RuntimeError: running_mean should contain 327 elements not 1024
python-BaseException

Process finished with exit code 1

I also tried to load the model as pretrained model as described here (Google Colab).
However, I just get another error when simply try to feed forward:

from speechbrain.lobes.models.ECAPA_TDNN import ECAPA_TDNN 
from speechbrain.utils.parameter_transfer import Pretrainer

model = ECAPA_TDNN(input_size= 80,
                   channels= [1024, 1024, 1024, 1024, 3072],
                   kernel_sizes= [5, 3, 3, 3, 1],
                   dilations= [1, 2, 3, 4, 1],
                   attention_channels= 128,
                   lin_neurons = 192)

# Initialization of the pre-trainer 
pretrain = Pretrainer(loadables={'model': model}, paths={'model': 'speechbrain/spkrec-ecapa-voxceleb/embedding_model.ckpt'})

# We download the pretrained model from HuggingFace in this case
pretrain.collect_files()
pretrain.load_collected(device='cpu')

signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
signal = rearrange(signal, 'b (n c) -> b n c', c=1)  #signal now of shape (1,64000,1) = (batch, time, channel) as suggested in the comments of the class ECAPA_TDNN

model(signal) #raises error

Error here is

  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 301, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/export/home/lay/PycharmProjects/spkrec-ecapa-voxceleb/sp_venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 297, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [1024, 80, 5], expected input[1, 1, 52177] to have 80 channels, but got 1 channels instead

Could it be that I have to transform my signal first with STFT or smth to get a time-frequency representation with 80 channels?
Would be nice if we can find any working solution. Thanks a lot.

Ohhkay. Third time’s the charm maybe? This time I reproduced the issue myself and played around with the code. I realised that there’s a with torch.no_grad(): in the feature extraction lobes - that will stop the gradients from being computed.

The feature extraction lobes have potentially learnable parameters, but they’re controlled by a separate flag, so this is not really needed. I created a PR here - you can try that, it should fix this issue. I think we will take a bit of care before merging the PR because it affects (in a very minor way but still) almost all recipes.

Okay, thanks a lot! It works now. You can also btw cheat around by using:

signal1, fs = torchaudio.load('exm1.wav')
    signal2, fs, = torchaudio.load('exm2.wav')
    classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", freeze_params=False)

    signal1 = signal1.repeat(2,1)
    signal2 = signal2.repeat(2, 1)

    signal1.requires_grad_(True)
    signal2.requires_grad_(True)
    embeddings1 = classifier.encode_batch(signal1)
    embeddings2 = classifier.encode_batch(signal2)

    e1 = embeddings1[0]
    e2 = embeddings2[0]

This works in the sense that e1,e2 have gradients. But there is another big issue (not related to this topic). The embeddings of two very different waveforms are embedded onto the same feature vector, i.e. e1 = e2 for two different signals. How can that be ?! I will open a new thread for this.