Using DDP with an Iterable dataset


I am currently trying to use SpeechBrain to run training using DistributedDataParallel (DDP). My dataset object is a Webdataset object (GitHub - webdataset/webdataset: A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.) and I’m using “speechbrain.dataio.iterators.dynamic_bucketed_batch” for batching the dataset. During my training, the execution stalls during the start of the next epoch indefinitely and doesn’t proceed with the next epoch after that. Also, I see the following warning when starting the training and I think the issue could be linked to this Cannot automatically solve distributed sampling for IterableDataset. Isn’t DDP supported yet for IterableDataset in SpeechBrain? Can anyone please provide more information regarding this? Is DataParallel supported with IterableDataset ?

Hey @anandcu3 ,

The stalling sounds like some issue in the inter-process communication or perhaps some issue with all processes reaching all DDP barriers. Debugging could be difficult, but one thing you could do is write some messages in all processes into a file that has that process rank as its name (1.txt, 2.txt, etc.) to see which states each process reach.

This warning probably is not relevant, it can be ignored because WebDataset solves the distributed sampling. (It has to be solved on the dataset side, as the IterableDataset Sampler is a trivial infinite sampler.)