I am seeing something unexpected when I run the
asr_brain.evaluate() function. The time it takes when I use 8 GPU is approximately the same as when I use 1 GPU. I can see the jobs are running in the 8 GPU when I force it to use 8 GPU, but somehow it seems that all of them are running the same job.
The way I tested is by using the train.py script (
recipes/LibriSpeech/ASR/transformer/train.py) after training the model for several epoch, and just changing the parameter
number_of_epochs in conformer_small.yaml to 1 so the script moves directly to the
asr_brain.evaluate() function skipping the training. The commands I used to compare the performance is
- 8 GPU
python -m torch.distributed.launch --nproc_per_node=1 train.py hparams/conformer_small.yaml --distributed_launch --distributed_backend=nccl
- Single GPU
python train.py hparams/conformer_small.yaml
Is it expected that the inference cannot be parallelized? Am I missing something on the configuration?
Thanks for your work and help!