when run training with batch_size more than 1 I got this error ZeroDivisionError: division by zero
when I set batch size to 1 it runs ok
the command I use is
python -m torch.distributed.launch --nproc_per_node=4 /home/ubuntu/DeepSpeech_latest/EX-HD/speechbrain/ASR/my-asr-yaml.yaml --distributed_launch --distributed_backend=‘nccl’ --batch_size=1
i am runing training on AWS instance g4dn.12xlarge with 4 GPU

Hi, we need more context to understand what happens. Like what is you data, your the yaml etc

data is my own data (arabic data) and yaml are the one hear

my experiment is here

According to the logs, it is most likely that you have a problem with your audio files. Maybe some of them are empty, returning a length of 0, thus causing this issue.

But why it is running when I use batch_size=1, it should not run too if audio is corrupted

No because with a batch_size of one, the padding function isn’t called. Maybe we should add a check for that though.

I solved this problem by going back to the original and train, YAML files here and do some modifications