Missing initial transcription in RNN-T

I am trying to develop a streaming ASR system. I have gone with the RNN-T model for this. The first thing I did is - modified the CRDNN encoder so that 1) I can get the RNN hidden states (to be used during inference) and 2) added padding in the convolution layer.

The inference is done with the following steps:

  1. Get an audio chunk (about 200-400 ms), queue 5 chunks, get features, and pass it to the encoder. If it is not the first pass to the encoder, pass the hidden and cell states of previous passes also.
  2. Take the encoder output and concatenate it with the previous encoder outputs.
  3. Finally take the concatenated output, pass to the beam searcher and get transcription.

This approach works smoothly for a few seconds (10-12 seconds). After this period, it takes a visibly longer time to get the transcriptions, which is not expected at all for streaming speech recognition.

To address this problem, I am using a Voice Activity Detection (VAD) module. If I get that 3 consecutive chunks do not contain speech, I am discarding the concatenated outputs and RNN hidden & cell states. Here I am facing another problem! After discarding, if the immediate chunks contain speech (even if very clear) I do not get any output for a few chunks! I have done several experiments here. Some are listed below:

  • Resetting only the concatenated outputs while keeping the hidden and cell states. Here I got repeated transcription of the last word before resetting. An example can be: আমি ভাত খাই খাই খাই খাই ও বিকালে খেলি (Sorry for non-English text. I decided to keep a real example). The expected output should be আমি ভাত খাই ও বিকালে খেলি
  • Resetting both concatenated outputs and RNN hidden & Cell states. This gives the missing transcription. Example: আমি ভাত খাই খেলি where the expected output should be আমি ভাত খাই ও বিকালে খেলি
  • Cut an audio file to remove the leading non-speech parts and pass the file for transcription. The same problem persists here also. Like: getting যোগাযোগের মাধ্যমেগুলোতে where the expected output is সামাজিক যোগাযোগের মাধ্যমগুলোতে

I am kind of at a dead-end now. Can you share any suggestions of guidelines especially:

  • How can I check and be confirm where the problem is. At the model or beam search?
  • And obviously, how can I solve this problem!

Hello @Shahad_Mahmud , I will open a PR today on Transducer part for:
1- Allowing the Transducer loss to learn a more streamebale model, see :
“FASTEMIT: LOW-LATENCY STREAMING ASR WITH SEQUENCE-LEVEL EMISSION REGULARIZATION”
2- I’m taking a look on more interesting decoding algorithm (with a stateless decoder) see:
“Tied & Reduced RNN-T Decoder”

For the problem you report, I’m analyzing it from my side, let me comeback to you soon on that part, in general, if you see the decoder, I’m extending the prediction within a beam search with a while loop here: speechbrain/transducer.py at 8cb958c65cd79ecfa681d7b61b56f39fd78737d1 · speechbrain/speechbrain · GitHub

We can assume that fixing the extension twice or 3 time max… for speed up the decoding

let me do some checking from my side

Thank you for the response. I’m looking at the suggestions you provided. Please let us know if you find something interesting.