I’m doing speaker conversion and looking for a metric for the quality my generated samples. I’ve seen that some papers, both in speaker conversion and in ASR, report MCD (Mel Cepstral Distortion), which requires alignment between two sound sequences and therefore is mostly preceded by DTW (Dynamic Time Warping) which is an alignment step.
However, I couldn’t find a good implementation anywhere. There’s https://github.com/MattShannon/mcd which does not work on raw waveforms. Is there anything in speechbrain/other recommended source?