Can we access the latent representations in wav2vec2 for phonemes?

Hi There,

I have been trying to use the SpeechBrain for end-to-end ASR. My question is: Can we access the latent representations in wav2vec2 for phonemes? If yes, any suggestions for how would that work?

Hi Manisha, you an access anything, but the difficulty of doing it will depend on what you call “latent” ?

Hello,

So, is it possible to use latent representations for phoneme identification?

Well yes, my question is more about : What is for you the latent representation of w2v2? Is it the last layer, the one before, an other one … etc

Latent representations for me would be right after the CNN layer. I would also like to try output of the last layer.

Okay then you can have a look at our wav2vec2 recipes. If you take a look into the compute forward, you’ll see that we call the wav2vec2 model. Here, you will obtain the last layer of the model (C). If you want the features, you have to modify the code of the lobes accordingly. If I remember correctly, there is an argument that you can pass to the Fairseq function (or it is a function to call) that will give you the features.