Loading data from a single file with for example hdf5

Hi!
Thank you for all the work you’ve put into SpeechBrain!

In my case I want to avoid having to read individual wavfiles from disk during training. So I put all the data into a big hdf5 array on disk, and in the CSV I just save the offset and length. However in the function definition for the dynamic item I need to refer to the opened file handler, but I can’t figure out how to do that.

It doesn’t seem possible to refer to the parent object (the DataPipeline) in the function definition (so I can’t just attach the file handler to it and then use it inside the function). And it doesn’t seem like it’s possible to define in takes(...) a constant argument (that is the same for all items).

So yeah, would much appreciate help on how to do this.

PS Question on the side, in my limited experience with systems that train on entire utterances it seems like one has to use a very small batch size for good performance, does this match your experience?

Hi! I think @Gastron will be happy to share about his newly merged webdataset feature :stuck_out_tongue:

Our go-to solution for loading data off larger archives is webdataset. In SpeechBrain we recently added some functionality that makes it easier to use webdataset with speech: basically on-the-fly bucketing and dynamic batching. I am working on a tutorial describing that whole data loading approach, but before it is done, you can have a look at the TIMIT example that we used for debugging (only available in Git history, for example: here). Note that this does not use DynamicItemDataset (which is a map-style dataset, whereas webdataset is an IterableDataset).

If you want to work with HDF5 files and DynamicItemDataset, I have a couple of suggestions.

  1. Use a class that keeps the HDF5-file open, like:
class HDF5Data:
    def __init__(self, path):
        self.path = path
        self.file = None

    def load_from_offset(self, offset):
        if self.file is None:
            self.file = h5py.File(self.path)
        return self.file[offset]   # or however you get the data from the offset.

audiodata = HDF5Data("/scratch/data/audio.hdf5")
# Then pass this as:
dataset.add_dynamic_item(audiodata.load_from_offset, takes=["offset"], provides=["signal"])
  1. I would also suggest distributing the data across multiple files (shards) so that you can load in parallel off all those. Then you would add an entry for the shard, and do something like:
class HDF5Data:
    def __init__(self, top_level_dir):
        self.top_level_dir = top_level_dir
        self.file = None
        self.current_path = None

    def load_from_offset(self, offset, shard):
        shardpath = self.top_level_dir + "/" + shard
        if self.file is None or self.current_path != shardpath:
            self.file = h5py.File(shardpath)
            self.current_path = shardpath
        return self.file[offset]   # or however you get the data from the offset.

audiodata = HDF5Data("/scratch/data/audio.hdf5")
dataset.add_dynamic_item(audiodata.load_from_offset, takes=["offset", "shard"], provides=["signal"])

There you should of course also take care to sort the data so that it is in order first by shard and then by offset.

Thank for the reply.

I have tried the webdataset implementation based off of the timit example and unfortunately that was not faster. It seems webdataset is still loading items one by one from the tar. In the end what worked best for me (and what I would recommend for others), is to implement an IterableDataset where parts of the data are loaded into memory, and that part is then used to create batches, while the next part is loaded in the background. This was nearly an order of magnitude faster than the other approaches.

And the PyTorch DataLoader already implements the kind of pre-fetch queue that @fasttosmile described. It can be used with any Dataset, but it only loads data in a background process with num_workers>=1.

However, the implementation details and hardware still dictate which approach works best.

To clarify, I’m not talking about loading individual samples in the background like the pytorch dataloader does, but loading very large batches of samples at once (~5GB), which are then used to create the batches for training.

It seems like webdataset might have that capability as well though: (question) Confused by slowdown after using ShardWriter instead of TarWriter · Issue #86 · webdataset/webdataset · GitHub still need to test that though

1 Like

The discussion you linked from webdataset GitHub seems to have good ideas to try, thanks! Still I think the hardware and implementation details will determine which approach works best, and I fear there is no single approach that would always work super well. Although it would be great for us to have a good default approach to recommend, which would work well in most cases.

1 Like