SpeechBrain for Sign Language Recognition?

Hi everyone i’m working on my undergraduate thesis on sign language recognition. My thesis title is “Recognition of signing sequences with hybrid archiectures”. My professor proposed that we use a DNN-HMM in kaldi. However the state of art perfomance of kaldi, I didnt find it very userfriendly. So I did my research and came across speechbrain.

So, since speech and vision recognition are both sequence learning problems, that kaldi needs to read a sequence of acoustic features (speech) or a sequence of visual features (vision) , I was wondering if I can use SpeechBrain for this task instead.

This is a great question.

First, you have to see with your advisor if the use of HMM is mandatory or not. SpeechBrain does not support them yet, even if it is planed.

If it is not, then it would become interesting in the sense that: SpeechBrain is designed to deal with sequences. As long as you have time dimension, you could in theory process any kind of “feature” dimension. So it could be either speech signal or images. However, I don’t think that the architectures we propose are well adapted to image processing. You would have to create an encoder that is able to encode a sequence of images before applying our decoder in the top of it. But this sounds cool .

1 Like

@titouan.parcollet Thank you for your answer, I really appreciate it.

I have some questions regarding your answer.

  1. Do you know when speechbrain is going to support HMM ?
  2. As I’m a beginner in the field of speech/visual recognition, can you describe me please some requirements the enconder will need to meet to apply it on top of SpeechBrain’s architecture ?
  3. And finally, after applying the enconder, can you suggest any tutorial for the decoding part ?

Thank you in advance.

Hey:

  1. It mostly depends on K2 and the resources we have. So I do not know …
  2. I think it would be better if you simply familiarize yourself with the literature. I can’t really tell you everything here :confused:
  3. This is related to point two. You can simply read papers about CTC / Attention decoding or transformers for speech and you will quickly understand how it works. Then you can look at the comments and docstrings in the code to understand how it is implemented.