Xvector pretrained model strange results


I have a weird problem when using the pretrained spkrec-xvect-voxceleb model from HuggingFace (https://huggingface.co/speechbrain/spkrec-xvect-voxceleb). When trying to perform speaker verification by calculating cosine similarity on embeddings (EncoderClassifier interface), the similarity scores are pretty much always very close to 1, even when i compare files of female and male speakers or speech to non speech sounds. I don’t encounter that problem when using the spkrec-ecapa-voxceleb model, and the predictions are correct (only difference in code is the model source and savedir path). I’ve tested the two models on the same 2 minute file (segmented into smaller chunks first) and I can see that they produce a similar pattern (similarity score vs. time) but the scale is waaay off on the xvector model.

Tried with and without normalization and it didn’t change much.

Any ideas on what is wrong with the model (or what am I doing wrong)?



Hi, same problem here. Did you work it out? Can you share solution if you made that model work well?

Hi. I’m pretty sure the xvector model is designed to be used with PLDA instead of cosine similarity. I ended up only using ECAPA-TDNN.

Hi. I have the same problem. the sem occurs, when I train my own xvector model (extracted embedding are very close to each other). Can it be, that the xvector embedding extractor is buggy? Did you find a solution to the problem?