I have a weird problem when using the pretrained
spkrec-xvect-voxceleb model from HuggingFace (https://huggingface.co/speechbrain/spkrec-xvect-voxceleb). When trying to perform speaker verification by calculating cosine similarity on embeddings (
EncoderClassifier interface), the similarity scores are pretty much always very close to 1, even when i compare files of female and male speakers or speech to non speech sounds. I don’t encounter that problem when using the
spkrec-ecapa-voxceleb model, and the predictions are correct (only difference in code is the model source and savedir path). I’ve tested the two models on the same 2 minute file (segmented into smaller chunks first) and I can see that they produce a similar pattern (similarity score vs. time) but the scale is waaay off on the xvector model.
Tried with and without normalization and it didn’t change much.
Any ideas on what is wrong with the model (or what am I doing wrong)?