Individual diarisation pipeline steps?

Hi, congratulations to the team for completing such an ambitious project! The timing is great too, given the sudden rise in live audio and podcasting.

I have skimmed through the docs and recipes, and I do wonder how you are looking to tackle tasks such as VAD, Speaker Change Detection, and Overlapping Speaker Detection. My understanding is that as far as diarisation goes you currently assume the input is a dataset of speech with speakers non-overlapping and pre-labelled. I am particularly interested in solving Speaker Change Detection. Any work currently in that direction?

Thank you

1 Like


Thank you for trying out SpeechBrain.

VAD module will soon be added to SpeechBrain. Currently, the VAD details are taken from the groundtruth.
Yes, in general, speaker diarization pipeline has many modules and it can sometimes become complicated. We follow simple pipeline and currently the recipe divides the audio into uniform chunks as also done in most papers. Change point detection/overlap detection is not yet added to SpeechBrain which can be a nice future additions.

Hi nauman,

I’ll try to build a model for change point detection over the next couple of weeks, if it works, I will post a PR.

Diego, Podz

Hi Nauman
thanks for your work, any further development on the VAD module ? We’re trying to not go back to Kaldi
alternatives also appreciated; we’re trying 3 second passphrase speaker verification

Some students with @mravanelli are working on VAD. Maybe they will update in SpeechBrain soon.

Any updates or ETA on when a proper diarization pipeline would be available ? One which would have VAD, speaker change detection, and embedding extractor ? So one could use a pre-trained model and fine-tune it over one’s own dataset and apply it to a evaluation set.

Still working on it. PR on VAD is on going. Note: You will never have a ready to use recipe that will work for your use case. All of these blocks, VAD, Diarization and embedding extractor aren’t even properly working on the wild (and still under strong research), so a system combining all of them is most likely to fail on a production environment. This is why it is not really a priority for us to release such a pipeline. SpeechBrain can do many of these pipelines, they could highly change overtime. It’s better to provide blocks, and let users do what they want with. But we will try …

Yes, I mean blocks only, such as VAD, SCD, etc. to have a baseline. In my case it’s research oriented, not production, so I am not fussed. Doing research to improve such components in the wild is purpose, and I look forward to being able to do it in direct PyTorch-Python instead of Kaldi which contains a mix of languages.

@nauman where are we with the diarization interface ?