HuggingFace Datasets🔗

HuggingFace provides a Python package as well as a repository for many machine learning datasets. However, there are some common issues with using it on Alvis.

  • By default the homedirectory is used to store the processed data.
  • File locking is used.
  1. Point HF_HOME to your project storage to not fill up your home directory.
  2. The first time you do load_dataset and the dataset is downloaded, do this on alvis2 the dedicated data transfer node where additionally it so happens that file locking works.
  3. When using the downloaded dataset: use the patched HF-Datasets from the module system and set the environment variable HF_USE_SOFTFILELOCK=true.

For this in practice in a particularly tricky case see our documentation on the ImageNet dataset.