Inference time single GPU vs multiple GPU


I am seeing something unexpected when I run the asr_brain.evaluate() function. The time it takes when I use 8 GPU is approximately the same as when I use 1 GPU. I can see the jobs are running in the 8 GPU when I force it to use 8 GPU, but somehow it seems that all of them are running the same job.

The way I tested is by using the script (recipes/LibriSpeech/ASR/transformer/ after training the model for several epoch, and just changing the parameter number_of_epochs in conformer_small.yaml to 1 so the script moves directly to the asr_brain.evaluate() function skipping the training. The commands I used to compare the performance is

  • 8 GPU
python -m torch.distributed.launch --nproc_per_node=1 hparams/conformer_small.yaml --distributed_launch --distributed_backend=nccl
  • Single GPU
python hparams/conformer_small.yaml

Is it expected that the inference cannot be parallelized? Am I missing something on the configuration?

Thanks for your work and help!

Hey, unfortunately, this is expected with DDP. All processes run the same evaluation.