Implementing streaming ASR

I am trying to implement a model for streaming ASR. I have explored a bit about this already. In a non-streaming end-to-end ASR model, we generally feed audio and corresponding text. The problem I am facing is feeding data for a streaming model. I think that I’ll have to divide the audio into chunks and pass these chunks into the model. But which portion of a text I should pass during training? Can you help me by sharing your idea or any resources?

Hi, it depends on what you want to do. For instance, you do not need to train your model in a “streaming” fashion. A standard training procedure can be used to train a model that has an architecture adapted to streaming ASR. (Btw, we are highly interested if you end up with something that you can share :-))

I’m also interested in streaming, I’m testing speechbrain/sepformer-whamr and it seems to work well but it can’t handle files larger than 30s
I’m using VAD to split up the file and run it through speechbrain/sepformer-whamr but Im worried about the model’s consistency with separating speakers in already split files (split using VAD)

Thank you for sharing your work. I’ll look into speechbrain/sepformer-whamr. I hope this will help me to understand the training procedure for streaming ASR.

Thank you for your thoughts. I was thinking that we need to train the model in streaming fashion i.e. dividing the data into chunks and then feed. But from your comment, I got that we actually need a model that adopts streaming architecture but the training procedure would be the same. Did I get it right?

Additionally, can you suggest any recipe from Speechbrain for this?