Expected speed on CPU and GPU?

Nice job on speech brain!

I’m trying out the pretrained model asr-crdnn-rnnlm-librispeech and was wondering if the speed I’m getting is typical. I’m running on a 25 second audio file with fairly clean audio (although the person is speaking quickly), and it’s taking about 5.4 seconds on a reasonably recent GPU and about 107 seconds on a 64 CPUs system. I know the AM is pretty big, but this seems slower than I would expect.

That’s interesting. Do you have some comparison with other toolkits on this utterance ? Note this models uses Attention decoding → The longer the wav is, the longer the attention context will be … A solution here would be to integrate: https://www-i6.informatik.rwth-aachen.de/publications/download/1118/MerboldtAndreZeyerAlbertSchl�terRalfNeyHermann--Ananalysisoflocalmonotonicattentionvariants--2019.pdf to SpeechBrain. Note that this would be super simple, but we currently don’t have the resources to do it (Interspeech …). The idea is also simple: Do not compute local attention over the whole hidden space, rather, starts at the t-1 argmax() and run it over a small window.

Maybe we should try to build a team interested in “optimising” SpeechBrain models for inference.

1 Like

I don’t have anything like apples-to-apples, but a near state-of-the-art Kaldi tdnn takes about 2.5 seconds on 1 CPU.

Well Kaldi isn’t E2E ASR and is C++ based. Plus it does not integrate LM decoding right ? You just do some rescoring afterward ? But yes, they have some online ASR which allows for fast decoding. But you won’t find anything “production ready” for E2E ASR right now on the market … Hopefully we will implement it soon in SpeechBrain if enough peoples are willing to work on it :smiley:

1 Like

As I said, I have nothing apples-to-apples (that is, nothing at all comparable). I was just wondering if this is the expected speed or if there’s something wrong with my setup.

Just to add another datapoint here, for Quartznet (nvidia fully conv. asr using time/depth separable convs), inference time is just under 1s for a 5s audio using a single CPU (no LM, greedy decoding or beam search w/o LM).

QuartzNet and Jasper are CTC only so decoding is way faster than attention based model (without windowing). We would like to add windowing support to our attention models to reduce the decoding time. It would be great if someone could train a CTC only model that could be compared to Quartznet.

So far, we didn’t consider low-latency models. There are many of them that we might wanna import in SpeechBrain like Quartnet or Citrinet. It would be really great is someone can work on that and ask for a PR.

I also faced similar issue with pretrained model of ASR.
I tried “asr-crdnn-rnnlm-librispeech” model.
It is not able to give the inference on more than 1-2minutes audio file, which is very small. And it’s very slow.
Is there limit over length of the audio file?
I tried deepspeech’s pretrained model for ASR which is able to give inference in 1191.673s for 1472.896s audio file.

I tried it over free version of google colab without gpu.

As for long inputs, a VAD is under development. When ready, it will automatically split long sequences into smaller chunks that could be transcribed easier by the speech recognizer.

A user did some preliminary decoding speed tests: here