Align wav2vec2 latent representation and CTC predicted token

I am trying to develop a multitask model using the ASR system.
Basically, I would like to know which wav2vec2 representations were used to compute a specific token with CTC. Even if there are multiple latent representations for each token, several methods are possible to encode the sequence in a single “summarized” representation. This would allow me to have both an embedding from the token (BERT) and a “signal embedding” which may contain useful information for downstream tasks such as dependency parsing.

I have looked into the implementation of the “S2SBeamSearcher” (source) class to check if there is an easy way to extract this information. I think the attention in the “forward_step()” function may have the answer, but I am unsure if it is the best way to do this.
Do you have any suggestions on how to obtain this information ?
Thanks a lot for the guidance and for building this framework.