ValueError: Loss is not finite and patience is exhausted


I am getting a loss value of ‘nan’ during the 2nd epoch of AM training. Error screenshot attached. Below are few additional details about the recipe & gpu -

  1. Recipes adapted for my own data from templates/speech_recognition/ASR.

  2. Skipped only the data preparation step.

  3. train.yaml: Using pre-trained Tokenizer & LM using domain data. Not using pre-trained AM.

  4. Using single GPU: NVIDIA Tesla K80 - 12 GB

  5. Command used to run: python train.yaml

I found that this error was previously reported for multi-GPU use in Github issue 704. Should I use multiple GPUs instead?

Follow up:

  1. Everytime I increase batch_size > 1, the code fails with: " RuntimeError: generator raised StopIteration" at “for batch in t” in Is the GPU memory too small to handle or I am doing something wrong?

Hey, so there are two issues here:

a. You get a NaN Loss. This is almost always due to the data, unless you are using LiGRU, that are prone to this error. Is your AM based on LiGRU? If not, it is most likely that one of your sample has something weird.

b. The best way to check the VRAM fitting is the following: use sorting:decreasing with the Loader, then try to increase the batch size. VRAM consumption is sentence length, batch size, model size dependent. If one of these values gets high (for instance, sentences longer than 25-30s), then it will get high.

1 Like

Thanks for your response.

  1. I am using the AM arch. exactly similar to ‘templates/speech_recognition/ASR/train.yaml’. I restarted the AM training with multi-GPUs/data-parallel (batch_size=1) and it almost reached the end of the 2nd epoch.

    Maybe some issue with sample(s). Most of my utterances are between 500 ms to 2 s. There are also few utterances shorter than 500 ms, and longer than 30 s. Maybe the very short utterances not providing enough frames/features.

Encoder: CRDNN
Decoder: GRU + beamsearch + RNNLM

  1. So I changed sorting → ‘descending’, but still get the same error (now attached screenshot). I also tried data-parallel option with 2-3 GPUs but doesn’t help.

Right, I suggest that your remove sentence shorter than 500ms (maybe 1s if you can) as well as longer than 30s. You can do it easily in the dataloader function, see the following recipe and the avoid_if_longer_than parameter:

You can do the same if shorter (have a look at both the yaml and the .py for how to implement it, it’s one line).

1 Like

Thanks @titouan.parcollet.

I will first summarize what I did for future readers. Added 2 hyperparameters in the yaml file to denote the max & min lengths. Next in the dataio_prepare function of, I added key_max_value & key_min_value arguments to the filtered_sorted() function with the hyperparameter names each corresponding to the max & min lengths respectively. Make sure to edit for both the sorting condition logic. I re-started the model training from scratch and did not face any errors related to loss or batch_size. Thsi time I tried a batch_size of 2 on 2 GPUs and it worked fine.

Finally, the WER was very high & I have a question regarding that. Although it’s very difficult for anyone to comment on specific data I am using, any recommendations for smaller datasets (~20-25 hours)? Based on some readings, it seems E2E models expect close to 100 hours to generate decent performance? Using pre-trained Librispeech AM also didn’t work initially.

1 Like