PyTorch¶
PyTorch is a popular machine learning (ML) framework.
A common use case is to import PyTorch as a module in Python. It is then up to
you as a user to write your particular ML application as a Python script using
the torch
Python module functionality.
We provide precompiled optimised installations of both legacy and recent
versions of PyTorch in our tree of software modules, see our
introduction to software modules. Just like with most software,
search for all available versions with module spider pytorch
. If you want to
run on CUDA accelerated GPU hardware, make sure to select a version with CUDA
.
It is also possible to run PyTorch using containers of which we
provide many versions already centrally installed.
PyTorch is heavily optimised for GPU hardware so we recommend using the CUDA version and to run it on the compute nodes equipped with GPUs. How to do this is described in our guide to running jobs.
Quick guide¶
- Use
module spider PyTorch-bundle
to find latest modules of PyTorch, torchvision etc. - Do you need a newer PyTorch version?
- Use GPUs primarily.
- Apply to Alvis to get access to GPUs.
- See PyTorch documentation and/or Alvis intro tutorial for using a GPU.
- Check that you are using a GPU and monitor your GPU usage.
- Use profiling to make good use of GPUs.
- For multi-GPU usage, check-out specifics for Alvis.
- Use the right precision for your use case.
- If GPU utilisation goes down between batches, look at your dataloading pipeline.
- If you are running long jobs, use checkpointing
Checking for available GPUs¶
After loading the PyTorch
module of your choice, your environment is now
configured to start using PyTorch in Python. Here is a small test that prints
the PyTorch version available in your environment:
If you intend to run your calculations on GPU hardware it can be useful to check
that PyTorch detects the GPU hardware using the torch.cuda
submodule. Here is
an example from a node equipped with a Nvidia Quadro GPU.
[cid@vera1 ~]$ python -c "import torch; print('CUDA enabled:', torch.cuda.is_available())"
CUDA enabled: True
[cid@vera1 ~]$ python -c "import torch.cuda as tc; id = tc.current_device(); print('Device:', tc.get_device_name(id))"
Device: Quadro P2000
To use GPUs checkout the official PyTorch documentation and/or the Alvis intro tutorial.
PyTorch-bundle¶
PyTorch-bundle is a module which bundles PyTorch, PyTorch-Ignite, torchvision,
torch_tb_profiler, torchtext and torchdata. Other less used PyTorch projects
like torchaudio
can be added on request.
Use e.g. module spider PyTorch-bundle/1.13.1-foss-2022a-CUDA-11.7.0
to see
what is included in a particular version of PyTorch-bundle.
Performance and precision¶
Which GPU you're using and which data type is used in computations can have a huge impact on performance at max utilisation (see GPU hardware details).
The main performance gain in using Ampere GPUs and newer (A40s and A100s in our case) comes from using the tensor cores. We recommend all PyTorch users to check out the following links:
- What Every User Should Know About Mixed Precision Training in PyTorch
torch.amp
,torch.set_float32_matmul_precision
Dataloading¶
Machine learning datasets commonly can be made up from a multitude of files. In HPC environments these can be less than ideal. Reading (and writing) is generally a lot slower compared when done to multiple files compared to a few large files. You can find our general tips at datasets.
Using multiple GPUs¶
This section will basically assume that you are using Alvis. In case you are a Vera user that is heavily reliant on GPUs for your machine learning with PyTorch, then you should apply for resources on Alvis where more GPU resourcesare available.
The two typical ways to scale to using multiple GPUs is:
- Data Parallelism
- Model Parallelism
Most multi-GPU jobs will benefit from GPU direct and Infiniband accross nodes. For multi node jobs check at least that you are:
- On a node with InfiniBand.
- The datatransfer between nodes is making use of InfiniBand, by e.g. using
job_stats.py <JOBID>
and checking the network graphs.
Data parallelism¶
With data parallelism you will have your model broadcast to all GPUs and then have separete batches on the different GPUs calculate the weight updates in parallel and then summarise into an update as if you had had a single large batch. This can be useful for speed-up or if you want to have larger batches than fit on the GPUs memory.
Data parallelism in PyTorch is best done with the
DistributedDataParallel
wrapper and the torchrun
command. You can find examples on data parallelism in
our Alvis tutorial.
Model parallelism¶
Model parallelism is about storing parts of the model on different GPUs. This is
used if your model is too large to fit on a single GPU, for the GPUs available
on Alvis this should rarely be a problem but in some rare cases you might reach
this limit. Remember that you can see your resource usage for a job with the
command job_stats.py
.
FAQ¶
Can you install a newer PyTorch version?¶
We're working with the EasyBuild community in preparing and patching new PyTorch versions. You can send a support question in case you want to be kept up to date for when we add a new version.
An alternative is to use the containers at
/apps/containers/PyTorch
. We happily build later versions of these on request.
Though, note that these have not undergone the same testing and patching
procedure as the module provided ones.
We typically recommend NGC containers over the plain PyTorch ones. For the meaning of the names see What does NGC mean in the containers?.
We do not recommend installing your own version unless necessary. If you install your own version it is up to you to make sure that the installation has the capabilities that you need:
- Is the software built for the CUDA compute capabilities corresponding to available GPU types?
- If you're making use of CPU computations, is the software built for AVX512?
- If you're doing multi GPU jobs, is the software built for GPUDirect and Infiniband?
- ...
What does NGC mean in the containers?¶
Containers with NGC in the name are from Nvidia's NGC
catalog.
Other containers at /apps/containers/PyTorch/
are official PyTorch containers
from dockerhub. Provided containers are not as patched and verified as the
provided modules, but should work well for most cases. They even have support
for communication over Infiniband when using NCCL for multinode communication.