Skip to content

PyTorch

PyTorch is a popular machine learning (ML) framework.

A common use case is to import PyTorch as a module in Python. It is then up to you as a user to write your particular ML application as a Python script using the torch Python module functionality.

We provide precompiled optimised installations of both legacy and recent versions of PyTorch in our tree of software modules, see our introduction to software modules. Just like with most software, search for all available versions with module spider pytorch. If you want to run on CUDA accelerated GPU hardware, make sure to select a version with CUDA. It is also possible to run PyTorch using containers of which we provide many versions already centrally installed.

PyTorch is heavily optimised for GPU hardware so we recommend using the CUDA version and to run it on the compute nodes equipped with GPUs. How to do this is described in our guide to running jobs.

Quick guide

Checking for available GPUs

After loading the PyTorch module of your choice, your environment is now configured to start using PyTorch in Python. Here is a small test that prints the PyTorch version available in your environment:

[cid@vera1 ~]$ python -c "import torch; print(torch.__version__)"
1.11.0

If you intend to run your calculations on GPU hardware it can be useful to check that PyTorch detects the GPU hardware using the torch.cuda submodule. Here is an example from a node equipped with a Nvidia Quadro GPU.

[cid@vera1 ~]$ python -c "import torch; print('CUDA enabled:', torch.cuda.is_available())"
CUDA enabled: True
[cid@vera1 ~]$ python -c "import torch.cuda as tc; id = tc.current_device(); print('Device:', tc.get_device_name(id))"
Device: Quadro P2000

To use GPUs checkout the official PyTorch documentation and/or the Alvis intro tutorial.

PyTorch-bundle

PyTorch-bundle is a module which bundles PyTorch, PyTorch-Ignite, torchvision, torch_tb_profiler, torchtext and torchdata. Other less used PyTorch projects like torchaudio can be added on request.

Use e.g. module spider PyTorch-bundle/1.13.1-foss-2022a-CUDA-11.7.0 to see what is included in a particular version of PyTorch-bundle.

Performance and precision

Which GPU you're using and which data type is used in computations can have a huge impact on performance at max utilisation (see GPU hardware details).

The main performance gain in using Ampere GPUs and newer (A40s and A100s in our case) comes from using the tensor cores. We recommend all PyTorch users to check out the following links:

Dataloading

Machine learning datasets commonly can be made up from a multitude of files. In HPC environments these can be less than ideal. Reading (and writing) is generally a lot slower compared when done to multiple files compared to a few large files. You can find our general tips at datasets.

Using multiple GPUs

This section will basically assume that you are using Alvis. In case you are a Vera user that is heavily reliant on GPUs for your machine learning with PyTorch, then you should apply for resources on Alvis where more GPU resourcesare available.

The two typical ways to scale to using multiple GPUs is:

  • Data Parallelism
  • Model Parallelism

Most multi-GPU jobs will benefit from GPU direct and Infiniband accross nodes. For multi node jobs check at least that you are:

  1. On a node with InfiniBand.
  2. The datatransfer between nodes is making use of InfiniBand, by e.g. using job_stats.py <JOBID> and checking the network graphs.

Data parallelism

With data parallelism you will have your model broadcast to all GPUs and then have separete batches on the different GPUs calculate the weight updates in parallel and then summarise into an update as if you had had a single large batch. This can be useful for speed-up or if you want to have larger batches than fit on the GPUs memory.

Data parallelism in PyTorch is best done with the DistributedDataParallel wrapper and the torchrun command. You can find examples on data parallelism in our Alvis tutorial.

Model parallelism

Model parallelism is about storing parts of the model on different GPUs. This is used if your model is too large to fit on a single GPU, for the GPUs available on Alvis this should rarely be a problem but in some rare cases you might reach this limit. Remember that you can see your resource usage for a job with the command job_stats.py.

FAQ

Can you install a newer PyTorch version?

We're working with the EasyBuild community in preparing and patching new PyTorch versions. You can send a support question in case you want to be kept up to date for when we add a new version.

An alternative is to use the containers at /apps/containers/PyTorch. We happily build later versions of these on request. Though, note that these have not undergone the same testing and patching procedure as the module provided ones.

We typically recommend NGC containers over the plain PyTorch ones. For the meaning of the names see What does NGC mean in the containers?.

We do not recommend installing your own version unless necessary. If you install your own version it is up to you to make sure that the installation has the capabilities that you need:

  • Is the software built for the CUDA compute capabilities corresponding to available GPU types?
  • If you're making use of CPU computations, is the software built for AVX512?
  • If you're doing multi GPU jobs, is the software built for GPUDirect and Infiniband?
  • ...

What does NGC mean in the containers?

Containers with NGC in the name are from Nvidia's NGC catalog. Other containers at /apps/containers/PyTorch/ are official PyTorch containers from dockerhub. Provided containers are not as patched and verified as the provided modules, but should work well for most cases. They even have support for communication over Infiniband when using NCCL for multinode communication.