Getting time alignment

Hey! Thank you for the great toolkit, the functionality looks awesome and I may end up using it more and more :>

I have been wondering if it is possible to get some kind of a word, phone or at least grapheme level alignments using a pretrained models. I am basically trying to use the ASR as an utterance segmentation tool. In kaldi I was able to do so, but it is quite troublesome to set up the kaldi environment where I work now so I was thinking that I could use some more recent toolkit.

The most straightforward approach would be to get ctc output of the same length as the input feature vector, where at each index there is an int mappable to a grapheme that naively corresponds to the acoustic event in that frame. I am trying to do that, but the decoder code looks quite complex and I am not even sure if that is possible to achieve with the CTC/Attention architecture which I haven’t used before.
I would be glad if I could get some guidance here, thank you!

I wonder if @lugosch didn’t try this already ?

Hey Elendium, SpeechBrain has a basic traditional HMM aligner you can try: speechbrain/aligner.py at develop · speechbrain/speechbrain · GitHub

But we’ve only gotten it working for TIMIT, and it may take some work to adapt it to your dataset.

Forced alignment with CTC or attention using whatever labels you have (words, wordpieces, phones) is actually not that hard to implement. Unfortunately, it’s not implemented in SpeechBrain yet. I don’t have time to do it, but maybe someone else would like to. I’ll just sketch out how it works:

For CTC: run the CTC forward algorithm on the encoder outputs, but replace logsumexp() with max(), and for each timestep record whether the argmax was s,s-1,or s-2. Then loop back over the argmaxes, and that gives you the path with the highest probability.

If you wanted to try this, I have the CTC forward algorithm implemented in PyTorch in this notebook, you would just have to add the max/argmax part:

https://github.com/lorenlugosch/graves-transducers/blob/master/ ctc.ipynb

For attention: run the model with teacher forcing (i.e. feed the correct transcript into the decoder, as is done in training), and for each label, find the timestep with the highest attention weight. Our attention modules return the attention weights (attn) along with the attention output: speechbrain/attention.py at develop · speechbrain/speechbrain · GitHub

If you would like a very easy-to-use forced alignment tool, I’d recommend the Montreal Forced Aligner.

The CTC forward might also be slightly easier to grasp compared to attention :wink: The Montreal Forced Aligner works pretty well as well.

Thanks guys for the replies :smiley: Especially @lugosch , thank you for sketching out the implementation and providing the notebooks, it will be very helpful.

One thing however is that what I am trying to do is not a forced alignment per se (because I do not have the ground truth transcription), it is rather extracting alignments from the predicted transcription. The rationale for that is that the ASR models are good at finding, for example, pauses between words, even if the words themselves are not predicted correctly. This should be especially true for CTC based models, as they are not constrained by a language model. So in other words, I believe it is quite likely that a change in ctc output from one token to other indicates a change in the underlying acoustic signal, even if the predicted token is wrong. I want to use these segments in a downstream algorithm.

You can try force-aligning the ASR transcript instead of the true transcript. Or, even simpler, you could just do greedy decoding (pick the argmax label at each timestep) and see where the CTC output peaks happen.

I should also note that if you use RNNs or self-attention in the encoder, there’s no guarantee the alignments from a CTC or attention model will be any good because the model can delay predictions (see this paper by SpeechBrain colleague Peter Plantinga: https://arxiv.org/pdf/2003.01765.pdf). If it doesn’t look like the alignments are good, you can try training a CNN encoder instead, which can’t delay predictions like RNNs/self-attention can.