HuggingFace Datasets🔗
HuggingFace provides a Python package as well as a repository for many machine learning datasets. However, there are some common issues with using it on Alvis.
- By default the homedirectory is used to store the processed data.
- File locking is used.
Recommended use🔗
- Point
HF_HOME
to your project storage to not fill up your home directory. - The first time you do
load_dataset
and the dataset is downloaded, do this on alvis2 the dedicated data transfer node where additionally it so happens that file locking works. - When using the downloaded dataset: use the patched
HF-Datasets
from the module system and set the environment variableHF_USE_SOFTFILELOCK=true
.
For this in practice in a particularly tricky case see our documentation on the ImageNet dataset.