the problem is, that the pretrained ECAPA TDNN outputs the same feature vector for two very different signals. Try it out by yourself:
signal1, fs = torchaudio.load('Cleanclnsp126_3Wjw0nadnM4_snr15_tl-22_fileid_0_00.wav') signal2, fs, = torchaudio.load('Cleanclnsp129_lJoaywrZPsU_snr0_tl-27_fileid_252_40.wav') classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", freeze_params=False) signal1 = signal1.repeat(2,1) signal2 = signal2.repeat(2, 1) signal1.requires_grad_(True) signal2.requires_grad_(True) embeddings1 = classifier.encode_batch(signal1) embeddings2 = classifier.encode_batch(signal2) e1 = embeddings1 e2 = embeddings2
I can’t upload the .wav files here that I used, but I am sure that this also applies to your .wav files. Just use two clean speech .wav files from two different speaker (I even used male and female) that are not in the training set of that pretrained net.