Thank you for this project.
It’s well documented and it might be the first time I’d really go deeper in ML.
We’re using Azure Speech To Text to transcribe video audio tracks.
I think SpeechBrain would allow us to do it in-house.
Is this possible with long videos?
I tried it with How AI can enhance our memory, work and social lives | Tom Gruber - YouTube but anything that came out of the script was “NOW IF IF I’VE KNOWN” which took 220s
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir=tmpdir)
response = asr_model.transcribe_file('sources/ted.wav')
Am I using it in the right way?
The problem is this model uses attention, which scans over the entire input when producing each output.
We’re looking into adding online ASR (see this other Discourse thread: Does SpeechBrain allow online decoding? - #2 by titouan.parcollet), which would enable you to process arbitrarily long audios. But for now, all the pre-trained models are limited to handling short audios (< 30 seconds).
One hacky way for you to process long audios would be to chop them up into smaller chunks and run the model on each chunk.
I need this to recognize sentences as well.
That would be impossible.
Google speech to text also use Attention.
Note that a VAD is under development. When applied it to a long input signal it will automatically split it into smaller chunks that can be processed by the ASR system.