Aims of this seminar

Alvis

Technical specifications

  • SNIC resource dedicated to AI/ML research funded by KAW
  • consists of SMP nodes accelerated with multiple GPUs
  • Alvis goes in production in three phases:
    • Phase 1A: equipped with 44 NVIDIA Tesla V100 GPUs (in production)
    • Phase 1B: equipped with 160 NVIDIA Tesla T4 GPUs, 4 NVIDIA Tesla A100 GPUs (in production)
    • Phase 2:
  • Node details: https://www.c3se.chalmers.se/about/Alvis/
  • Phase 2: inauguration planned for December 2021

GPU hardware details

  • 44 x NVIDIA Tesla V100 with 32GB VRAM, compute capability 7.0
    • FP16 (half) 31.33 TFLOPS, FP32 (float) 15.67 TFLOPS, FP64 (double) 7.834 TFLOPS
  • 160 x NVIDIA Tesla T4 with 16GB VRAM, compute capability 7.5
    • FP16 (half) 65.13 TFLOPS, FP32 (float) 8.141 TFLOPS, FP64 (double) 0.254 TFLOPS
  • 4 x NVIDIA Tesla A100 with 40GB VRAM, compute capability 8.0
    • FP16 (half) 77.97 TFLOPS, FP32 (float) 19.49 TFLOPS, FP64 (double) 9.746 TFLOPS
  • Phase 2 (not yet available)
    • 340 x NVIDIA Tesla A40 with 48GB VRAM
    • 224 x NVIDIA Tesla A100 with 40GB VRAM
    • 32 x NVIDIA Tesla A100 with 80GB VRAM
    • New super fast 2 tier storage resource Mimer

The compute cluster

The cluster environment

Connecting

  • Login server: alvis1.c3se.chalmers.se
  • Login nodes are shared resources for all users; don't run heavy jobs here, don't use up to much memory.
  • Login node is equipped with 4 T4 GPUs and can be used for brief testing, development, debugging, and light pre/post-processing.

Storage

  • See filesystem page for documentation, with examples on how to share files.
  • Your (backed up) home directory is quite limited; additional storage areas must be applied for via additional storage projects in SUPR.
  • The C3SE_quota shows you all your centre storage areas, usage and quotas.
Path: /cephyr/users/my_user
  Space used:    17.5GiB       Quota:      30GiB
  Files used:      20559       Quota:      60000

Path: /cephyr/NOBACKUP/groups/my_storage_project
  Space used:  2646.5GiB       Quota:    5000GiB
  Files used:     795546       Quota:    1000000
  • Learn the common tools: cd, pwd, ls, cp, mv, rsync, rmdir, mkdir, rm

Datasets

  • Depending on the license type and permissions, a number of popular datasets have been made semi-publicly available through the central storage under /cephyr/NOBACKUP/Datasets/
  • In all cases, the use of the datasets are only allowed for non-commercial, research applications
  • Note that in certain cases, the provider of the dataset requires you to cite some literature if you use the dataset in your research
  • It is the responsibility of the users to make sure their use of the datasets complies with the above-mentioned permissions and requirements
  • In some cases, further information about the dataset can be found in a README file under the pertinent directory
  • A list of the currently available datasets and supplementary information can be found under datasets

Software

  • Our systems currently run CentOS 7 (64-bit), which is a open-source version of Red Hat Enterprise Linux
    • CentOS's not Ubuntu!
    • Users do NOT have sudo rights!
    • You can't install software using apt-get!
    • The system installation is intentionally sparse; you access software via modules and containers.
  • Note: OS will be updated when we deploy phase 2 to a CentOS 8 clone.

Software - Containers

  • It is possible to run containers via Singularity
  • No, it will not be possible to run docker, but you can easily convert docker containers to singularity containers
  • We provide some containers under /apps/containers
  • Instructions on how to use the images and some tutorial can be found under advanced topics
  • Singularity also supports overlay filesystems
singularity shell --overlay overlay_1G.img anaconda-2021.05.sif
conda install tensorflow # exit
singularity exec --overlay overlay_1G.img:ro anaconda-2021.05.sif catrecognizer.py

Software - Modules

  • A lot of software available in modules.
  • Commercial software and libraries; MATLAB, CUDA, Nsight Compute and much more.
  • Tools, compilers, MPI, math libraries, etc.
  • Major version update of all software versions is done twice yearly
    • 2020b: GCC 10.2.0, OpenMPI 4.0.5, CUDA 11.1.1, Python 3.8.6, ...
    • 2021a: GCC 10.3.0, OpenMPI 4.1.1, CUDA 11.3.1, Python 3.9.5, ...
    • 2021b: GCC 11.2.0, OpenMPI 4.1.1, CUDA 11.4.1, Python 3.9.6, ... (WIP)
  • Mixing toolchains versions will not work
  • Popular top level applications such as TensorFlow and PyTorch may be updated within a single toolchain version.

Software - Toolchains

  • Tip: You can test things out on the login node. Try loading and purging modules; changes are temporary.
  • Putting load commands directly in your ~/.bashrc will likely break system utilities like Thinlinc (for you).
  • module load Foo/1.2.3 or ml Foo/1.2.3 for loading
  • module list or ml to list all currently loaded modules
  • module spider Bar or ml spider Bar for searching
  • module purge or ml purge for unloading all modules
  • Modules provide development information as well, so can be used as dependencies for builds.
  • Hierarchical and flat module system available (exact same software). Probably only the flat modules will be availble in the future. Switch with hierachical_modules and flat_modules. See module page for details.

Software installation

  • We build a lot of modules and containers for general use upon request.
  • We provide pip, singularity, conda, and virtualenv so you can install your own Python packages locally.
    • See Python information on our homepage for examples.
    • Be aware of your quota! Consider making a container if you need environments.
  • You can use modules for linking to software you build yourself.
  • Trying to mix conda, virtualenv, containers, and modules will not work well.

Software - Installing binary (pre-compiled) software

  • Common problem is that software requires a newer glibc version. This is tied to the OS and can't be upgraded.
    • You can use Singularity container to wrap this for your software.
  • Make sure to use binaries that are compiled optimised for the hardware.
    • Alvis CPUs support up to AVX512.
    • Difference can be huge. Example: Compared to our optimised NumPy builds, the generic x86 version from pip is ~3x slower on Hebbe, and ~9x slower on Vera.
  • CPU: AVX512 > AVX2 > AVX > SSE > Generic instructions.
  • GPU: Make sure you use the right CUDA compute capabilities for the GPU you choose.

Running jobs on Alvis

  • Alvis is dedicated to AI/ML research which typically involves GPU-hungry computations; therefore, your job must allocate at least one GPU
  • You only allocate GPUs (cores and memory is assigned automatically)
  • Hyperthreading is disabled on Alvis
  • Alvis comes in three phases (I, II, and III), and there is a variety in terms of:
    • number of cores
    • number and type of GPUs
    • memory per node
    • CPU architecture

SLURM

  • Alvis runs the SLURM workload manager, a batch queuing software
  • Allocations and usage is defined on a (SNIC) project level.
  • Fairshare system; you can go over your monthly allocation but your past 30 days ("monthly") rolling average affects your queue priority.
  • For more details see Running jobs.

Job command overview

  • sbatch: submit batch jobs
  • srun: submit interactive jobs
  • jobinfo (squeue): view the job-queue and the state of jobs in queue, shows amount of idling resources
  • scontrol show job <jobid>: show details about job, including reasons why it's pending
  • sprio: show all your pending jobs and their priority
  • scancel: cancel a running or pending job
  • sinfo: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.
  • sacct: show scheduling information about past jobs
  • projinfo: show the projects you belong to, including monthly allocation and usage
  • For details, refer to the -h flag, man pages, or google!

Allocating GPUs on Alvis

  • Specify the type of GPUs you want and the number of them per node, e.g:
    • #SBATCH --gpus-per-node=V100:2
    • #SBATCH --gpus-per-node=T4:3
    • #SBATCH --gpus-per-node=A100:1
  • If you need more memory, use the constraint flag -C to pick the nodes with more RAM:
    • #SBATCH --gpus-per-node=V100:2 -C 2xV100 (only 2 V100 on these nodes, thus twice the RAM per gpu)
    • #SBATCH --gpus-per-node=T4:1 -C MEM1536
  • Many more expert options:
    • #SBATCH --gpus-per-node=T4:8 -N 2 --cpus-per-task=32
    • #SBATCH -N 2 --gres=ptmpdir:1
    • #SBATCH --gres=gpuexlc:1,mps:1
  • Mixing GPUs of different types is not possible

GPU cost on Alvis

Type VRAM System memory per GPU CPU cores per GPU Cost
T4 16GB 72 or 192 GB 4 0.35
A40* 48GB 1
V100 32GB 96 or 192 GB 8 1.31
A100* 40GB 1.84
A100fat* 80GB 2.2
  • Example: using 2xT4 GPUs for 10 hours costs 7 "GPU hours" (2 x 0.35 x 10).
  • The cost reflects the actual price of the hardware (normalised against an A40 node/GPU).
  • * available in the near future. Currently only 4 A100 GPUs are available from phase 1.

Querying visible devices

  • Control groups (an OS feature) is used automatically to limit your session to the GPU you request.
  • Using $CUDA_VISIBLE_DEVICES you can make sure that your application has correctly picked up the hardware
srun -A YOUR_ACCOUNT -t 00:02:00 --gpus-per-node=V100:2 --pty bash
srun: job 22441 queued and waiting for resources
srun: job 22441 has been allocated resources
$ echo ${CUDA_VISIBLE_DEVICES}
0,1
  • Most software tend to "just work"

Long running jobs

  • We only allow for maximum 7 days walltime.
  • Anything long running should use checkpointing of some sort to save partial results.
  • You will not be recompensed for aborted simulations from hardware or software errors.
  • With phase 2, the plan is to introduce a partition which only allows for short jobs for interactive use.

Multi-node jobs

  • For multi node jobs your application will need to handle all the inter-node communication, typically done with MPI.
  • You may need to port your problem to a framework that supports distributed learning, e.g. Horovod
  • If you can run multiple separate jobs with fewer GPUs each this is preferable for system utilisation.
  • You will only be able to allocate full nodes when requesting more than one.

Example: Working with many small files

#!/usr/bin/env bash
#SBATCH -A SNIC2020-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=V100:1

unzip many_tiny_files_dataset.zip -d $TMPDIR/
singularity exec --nv ~/tensorflow-2.1.0.sif trainer.py --training_input=$TMPDIR/
# or use available containers e.g.
# /apps/containers/TensorFlow/TensorFlow_v2.3.1-tf2-py3-GPU-Jupyter.sif
  • Prefer to write code that uses HDF5, netCDF, tar directly. h5py is very easy to use.

Example: Job arrays

#!/usr/bin/env bash
#SBATCH -A SNIC2020-Y-X -p alvis
#SBATCH -t 5:00:00
#SBATCH --gpus-per-node=T4:2
#SBATCH --array=0-9
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end

module load fosscuda/2019b PyTorch/2.3.1-Python-3.7.4 HDF5
python classification_problem.py dataset_$SLURM_ARRAY_TASK_ID.hdf5
  • Environment variables like $SLURM_ARRAY_TASK_ID can also be accessed from within all programming languages, e.g:
array_id = getenv('SLURM_ARRAY_TASK_ID'); % matlab
array_id = os.getenv('SLURM_ARRAY_TASK_ID') # python

Example: Multi-node

#!/usr/bin/env bash
#SBATCH -A SNIC2020-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=T4:8
## 2 tasks across 2 nodes
#SBATCH --nodes 2 --ntasks 2

module load fosscuda/2019b Horovod/0.19.1-TensorFlow-2.1.0-Python-3.7.4

mpirun python horovod_keras_tf2_example.py
  • Multi-node jobs start on the first node which should then launch the rest (with mpirun/srun).
  • Make sure you are using the resources you request!
  • If using a container, you need to load a matching MPI from the module system

Interactive use

  • Alvis is a batch queue system, you should expect a queue sometimes. Bulk of simulations should be in queued batch jobs.
  • Use jobinfo to pick what happens to be idle.
  • Login node allows for light interactive use; it has 4 T4 GPUs, but they are all shared.
    • Use nvidia-smi to check current usage and select your GPU number with export CUDA_VISIBLE_DEVICES=X.`
  • Login node needs to be restarted occasionally; do not make your production runs rely on the login nodes uptime!
  • Run interactively on compute nodes with srun, e.g.
    • srun -A SNIC2020-X-Y -p alvis --gpus-per-node=T4:1 --pty bash

Jupyter Notebooks

  • Jupyter Notebooks can run on login node or on compute nodes
srun -A SNIC2020-X-Y -p alvis -t 8:00:00 --gpus-per-node=V100:1 --pty \
    jupyter lab
srun -A SNIC2020-X-Y -p alvis -t 4:00:00 --gpus-per-node=T4:1 --pty \
    singularity exec --nv my_container.sif jupyter notebook
  • Start long running notebooks with sbatch to decouple them from the login node.

Portal (under heavy development)

  • Open OnDemand portal https://portal.c3se.chalmers.se
  • Possible to log in right now, but under heavy development!
  • Can be used to launch notebooks and desktops on nodes.
  • We are looking to add several more features; viewing storage projects, improving job templates, additional applications

Job and queue monitoring

  • jobinfo shows you the queue and available GPUs
    • You can ssh into nodes when jobs are running, and for example run nvidia-smi or htop.
  • job_stats.py JOBID gives you an URL to a public Grafana page for your job usage.
  • dcgmi reports after jobs finish
  • sinfo -Rl command shows reason if nodes are down (typically for maintenance)
  • Alvis Grafana page shows state of login node and queue.

Tensorboard

  • We have a Tensorboard guide
    • Add a Tensorboard callback to generate logs to a job-specific directory (overlapping logs confuses Tensorboard!)
    • Connecting via SSH tunnel (preferable), Thinlinc, or proxy server.
    • Tip: the SSH tunnel can also be used for running other services on nodes, like code-server.
  • Be aware of security, because Tensorboard offers none!

Things to keep in mind

  • Never run (big or long) jobs on the login node! otherwise, the misbehaving processes will be killed by the administrators
    • If this is done repeatedly, you will be logged out, and your account will temporarily be blocked
  • You can however use the login node for interactively:
    • Preparing your job and checking if everything's OK before submitting the job
    • Debugging a lightweight job and running tests
  • You are expected to keep an eye on how your job performs especially for new jobscripts/codes!
    • Linux command line tools available on the login node and on the allocated nodes can help you check CPU, memory and network usage

Getting support

  • We provide support to our users, but not for any and all problems
    • We can help you with software installation issues, and recommend compiler flags etc. for optimal performance
    • We can install software system-wide if there are many users who need it - but probably not for one user (unless the installation is simple)
    • We don't support your application software or help debugging your code/model or prepare your input files.
    • Book a time to meet us under office hours for help with things that are hard to put into a support request email

Error reports

  • In order to help you, we need as much and as good information as possible:

    • What's the job-ID of the failing job?
    • What working directory and what job-script?
    • What software are you using?
    • What's happening - especially error messages?
    • Did this work before, or has it never worked?
    • Do you have a minimal example?
    • No need to attach files; just point us to a directory on the system.
  • Support cases must go through https://supr.snic.se/support

Example jobs

  • We try to provide a repository of example jobs
    • https://github.com/c3se/alvis-intro
    • Examples can be used to get up and running with their your first job quickly.
    • Showcase useful tips and tricks which maybe even experienced users can learn from.