Speech quality objective metric (DTW-MCD?)

Hi all,

I’m doing speaker conversion and looking for a metric for the quality my generated samples. I’ve seen that some papers, both in speaker conversion and in ASR, report MCD (Mel Cepstral Distortion), which requires alignment between two sound sequences and therefore is mostly preceded by DTW (Dynamic Time Warping) which is an alignment step.

However, I couldn’t find a good implementation anywhere. There’s https://github.com/MattShannon/mcd which does not work on raw waveforms. Is there anything in speechbrain/other recommended source?


Hi ! I think we currently have nothing for that. @mravanelli do you know if the guys from the TTS team has implemented a metric of this kind ?