Introduction to Alvis

Chalmers e-Commons/C3SE

2025-01-15

Aims of this seminar

Introduce the Alvis system
- Hardware
- Storage and filesystems
- Software and containers
- Job submissions
See dates and register for future seminars through:
- https://indico.chalmers.se/category/39/
This presentation is available on the C3SE web page:
- https://www.c3se.chalmers.se/documentation/first_time_users/intro-alvis/slides/
- https://www.c3se.chalmers.se/documentation/first_time_users/intro-alvis/presentation.html

Alvis

Technical specifications

NAISS resource dedicated to AI/ML research funded by KAW
Consists of nodes accelerated with multiple GPUs
For node details, see about Alvis

Eligibilitiy

Swedish academic researcher as PI
- NAISS Small Compute: PI ≥ PhD-student
- NAISS Medium/Large Compute: PI ≥ Assistant Professor
Workloads using GPUs
Guide for getting access

GPU hardware details

#GPUs	GPUs	Capability	CPU	Note
44	V100	7.0	Skylake
160	T4	7.5	Skylake
332	A40	8.6	Icelake	No IB
296	A100	8.0	Icelake	Fast Mimer
32	A100fat	8.0	Icelake	Fast Mimer

Cluster sketch

The cluster environment

Connecting

Cluster firewall only allows connection from within SUNET

Firewall and VPN

Firewall limits connections to within SUNET
Use a VPN if needed

Log-in nodes

alvis1.c3se.chalmers.se has 4 T4 GPUs for light testing and debugging
alvis2.c3se.chalmers.se is dedicated data transfer node
Will be restarted from time to time
Login nodes are shared resources for all users:
- don’t run jobs here,
- don’t use up too much memory,
- preparing jobs and
- light testing/debugging is fine

SSH - Secure Shell

ssh <CID>@alvis1.c3se.chalmers.se, ssh <CID>@alvis2.c3se.chalmers.se
Gives command line access to do anything you could possibly need
If used frequently you can set-up a password protected SSH-key for convenience

Alvis Open OnDemand portal

https://alvis.c3se.chalmers.se
Browse files and see disk and file quota
Launch interactive apps on compute nodes
- Desktop
- Jupyter notebooks
- MATLAB proxy
- RStudio
- VSCode
Launch apps on log-in nodes
- TensorBoard
- Desktop
See our documentation for more

Remote desktop

RDP-based remote desktop solution on shared login nodes (use portal for heavier interactive jobs)
In-house-developed web client:
- https://alvis1.c3se.chalmers.se
- https://alvis2.c3se.chalmers.se
Can also be accessed via desktop clients such as Windows Remote Desktop Connection (Windows), FreeRDP/Remmina/krdc (Linux) and Windows App (Mac) at <alvis1.c3se.chalmers.se> and <alvis2.c3se.chalmers.se> (standard port 3389).
Desktop clients tend to offer better quality and more ergonomic experiences.
See the documentation for more details

Files and Storage

The cluster environment

Cephyr /cephyr/, and Mimer /mimer/ are parallel filesytems, accessible from all nodes.

Parallel filesystems

Accessible from all servers
Backed up home directory at /cephyr/users/<CID>/Alvis (alt. use ~)
Project storage at /mimer/NOBACKUP/groups/<storage-name>
The C3SE_quota shows you all your centre storage areas, usage and quotas.
- On Cephyr see file usage with where-are-my-files
File-IO is usually the limiting factor on parallel filesystems
If you can deal with a few large files instead of many small, that is preferable

Transfering files

See filesystem page for documentation, with examples on how to share files.
Transfer via alvis2 (dedicated data transfer node)
Same connection requirements as usual
Command line tools like scp, rsync, rclone
Graphical tools like CyberDuck and WinSCP
Run transfer tools on your local machine

# Transfering <your-file> from your computer to your storage project
scp <your-file> <CID>@alvis2.c3se.chalmers.se:/mimer/NOBACKUP/<storage-name>/

# Transfering <your-file> to <path-to-dir> on your computer
scp <CID>@alvis2.c3se.chalmers.se:/mimer/NOBACKUP/<storage-name>/<your-file> <path-to-dir>

Editing files and traversing the filesystem

Learn the common tools: cd, pwd, ls, cp, mv, rsync, rmdir, mkdir, rm
Learn to use a command line editor: nano, vim, emacs, …

Data policies

Data deletion policy for storage projects.
See NAISS UA for user data deletion.

Datasets

We allowed, we provide popular datasets at /mimer/NOBACKUP/Datasets/
In all cases, only allowed use is for non-commercial, research applications
Note that original provider may require you to cite some literature if you use if you use the dataset in your research
It is your responsibility to make sure your use is compliant
In some cases, a README file with information is found by the dataset
A list of the currently available datasets and supplementary information can be found under datasets
For details on transferring datasets to and from Alvis see, bulk data transfer page.

Software

Our systems currently run Rocky Linux 8, which is an open-source version of Red Hat Enterprise Linux
- It’s not Ubuntu!
- Users do NOT have sudo rights!
- You can’t install software using apt-get!
- The system installation is intentionally sparse; you access software via modules and containers.

Containers

It is possible to run containers via Apptainer
- Apptainer is a fork off of Singularity
No, it will not be possible to run docker, but you can easily convert docker containers to apptainer containers
We provide some containers under /apps/containers
Instructions on how to use build and use containers
You can build your container on the log-in nodes with apptainer build my_container.sif my_recipe.def

Modules

A lot of software available in modules.
Commercial software and libraries; MATLAB, CUDA, Nsight Compute and much more.
Tools, compilers, MPI, math libraries, etc.
Major version update of all software versions is done twice yearly
- 2022b: GCC 12.2.0, OpenMPI 4.1.4, CUDA 12.0.0, Python 3.10.8, …
- 2023a: GCC 12.3.0, OpenMPI 4.1.5, CUDA 12.1.1, Python 3.11.3, …
- 2023b: GCC 13.2.0, OpenMPI 4.1.6, CUDA 12.4.0, Python 3.11.5, …
Mixing toolchains versions will not work
Popular top level applications such as TensorFlow and PyTorch may be updated within a single toolchain version.

Toolchains

Tip: You can test things out on the login node. Try loading and purging modules; changes are temporary.
Putting load commands directly in your ~/.bashrc will likely break some system utilities for you.
module load Foo/1.2.3 or ml Foo/1.2.3 for loading
module list or ml to list all currently loaded modules
module spider Bar or ml spider Bar for searching
module keyword Bar or ml keyword Bar for searching keywords (e.g. extensions in python bundles)
module purge or ml purge for unloading all modules
Modules provide development information as well, so can be used as dependencies for builds.
Flat module system.

Software installation

We build a lot of modules and containers for general use upon request.
We provide pip, apptainer, conda, and virtualenv so you can install your own Python packages locally.
- See Python information on our homepage for examples.
- Do not use pip install --user, this is likely to make a mess when used with any other approach and fill up your home directory quota quickly.
- Be aware of your quota! Consider making a container if you need environments.
You can use modules for linking to software you build yourself.
Trying to mix conda, virtualenv, containers, and modules does not work well in general. Only exceptions are:
- Virtualenv on top of modules, OK
- Conda base environment inside miniforge containers, OK

Installing binary (pre-compiled) software

Occasional issue is that software requires a newer glibc version. This is tied to the OS and can’t be upgraded.
- You can use an Apptainer container to wrap this for your software.
Make sure to use binaries that are compiled optimised for the hardware.
- Alvis CPUs support up to AVX512.
- Difference can be huge. Example: Compared to our optimised NumPy builds, a generic x86 version from pip up to ~9x slower on Vera.
CPU: AVX512 > AVX2 > AVX > SSE > Generic instructions.
GPU: Make sure you use the right CUDA compute capabilities for the GPU you choose.

Using GPUs

Main workload should be on GPUs.
Using existing frameworks (PyTorch, TensorFlow, JAX, …) is usually best
Replace heavy NumPy/SciPy workloads with CuPY, see documentation
Monitor jobs running new code to see that GPUs are used (job_stats.py, nvidia-smi, …)
Consider what floating-point types you are using:
- lower precision is faster (but too low may impact results)
- float64 is very slow and typically not for machine learning
- float32 to TensorFloat32 or float16/bfloat16 can give substantial speed-up
- see some examples for matrix multiplication here
Jobs allocating GPUs but barely using them may be terminated
- for light work-loads, T4 is our smallest GPU

Running jobs on Alvis

Alvis is dedicated to AI/ML research which typically involves GPU-hungry computations; therefore, your job must allocate at least one GPU
You only allocate GPUs (cores and memory is assigned automatically)
Hyperthreading is disabled on Alvis
Alvis comes in three phases (I, II, and III), and there is a variety in terms of:
- number of cores
- number and type of GPUs
- memory per node
- CPU architecture

SLURM

Alvis runs the SLURM workload manager, a batch queuing software
Allocations and usage is defined on a (NAISS) project level.
Fairshare system; you can go over your monthly allocation but your past 30 days (“monthly”) rolling average affects your queue priority.
For more details see Running jobs.

Working on the log-in node

Submitting a job

Submitting a job to the queue

Job starts

Job starts on compute node

Job command overview

sbatch: submit batch jobs
srun: submit interactive jobs
jobinfo (squeue): view the job-queue and the state of jobs in queue, shows amount of idling resources
scontrol show job <jobid>: show details about job, including reasons why it’s pending
sprio: show all your pending jobs and their priority
scancel: cancel a running or pending job
sinfo: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.
sacct: show scheduling information about past jobs
projinfo: show the projects you belong to, including monthly allocation and usage
For details, refer to the -h flag, man pages, or Google!

Allocating GPUs on Alvis

Specify the type of GPUs you want and the number of them per node, e.g:
- #SBATCH --gpus-per-node=V100:2
- #SBATCH --gpus-per-node=T4:3
- #SBATCH --gpus-per-node=A100:1
If you need more memory, use the constraint flag -C to pick the nodes with more RAM:
- #SBATCH --gpus-per-node=V100:2 -C 2xV100 (only 2 V100 on these nodes, thus twice the RAM per gpu)
- #SBATCH --gpus-per-node=T4:1 -C MEM1536
Jobs not using allocated GPUs may be terminated, to get no GPUs
- #SBATCH -C NOGPU
Many more expert options:
- #SBATCH --gpus-per-node=T4:8 -N 2 --cpus-per-task=32
- #SBATCH -N 2 --gres=ptmpdir:1
- #SBATCH --gres=gpuexlc:1,mps:1
Mixing GPUs of different types is not possible

GPU cost on Alvis

Type	VRAM	System memory per GPU	CPU cores per GPU	Cost
T4	16GB	72 or 192 GB	4	0.35
A40	48GB	64 GB	16	1
V100	32GB	192 or 384 GB	8	1.31
A100	40GB	64 or 128 GB	16	1.84
A100fat	80GB	256 GB	16	2.2

Example: using 2xT4 GPUs for 10 hours costs 7 “GPU hours” (2 x 0.35 x 10).
The cost reflects the actual price of the hardware (normalised against an A40 node/GPU).

GPU peak performance (×10¹² op/s)

Data type	A100	A40	V100	T4
FP64	9.7 \| 19.5*	0.58	7.8	0.25
FP32	19.5	37.4	15.7	8.1
TF32	156**	74.8**	N/A	N/A
FP16	312**	149.7**	125	65
BF16	312**	149.7**	N/A	N/A
Int8	624**	299.3**	64	130
Int4	1248**	598.7**	N/A	260

For more info around the table, see about Alvis
For PyTorch users see performance-and-precision (tl;dr activate TF32 mixed precision on A100/A40 GPUs)

Querying visible devices

Control groups (an OS feature) is used automatically to limit your session to the GPU you request.
Using $CUDA_VISIBLE_DEVICES you can make sure that your application has correctly picked up the hardware

srun -A YOUR_ACCOUNT -t 00:02:00 --gpus-per-node=V100:2 --pty bash
srun: job 22441 queued and waiting for resources
srun: job 22441 has been allocated resources
$ echo ${CUDA_VISIBLE_DEVICES}
0,1

Most software tend to “just work”

Long running jobs

We only allow for maximum 7 days walltime.
Anything long running should use checkpointing of some sort to save partial results.
You will not be recompensed for aborted simulations from hardware or software errors.

Multi-node jobs

For multi node jobs your application will need to handle all the inter-node communication, typically done with MPI.
You may need to port your problem to a framework that supports distributed learning, e.g. Horovod
If you can run multiple separate jobs with fewer GPUs each this is preferable for system utilisation.
You will only be able to allocate full nodes when requesting more than one.

Example: Working with many small files

#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=V100:1

unzip many_tiny_files_dataset.zip -d $TMPDIR/
apptainer exec --nv ~/tensorflow-2.1.0.sif trainer.py --training_input=$TMPDIR/
# or use available containers e.g.
# /apps/containers/TensorFlow/TensorFlow_v2.3.1-tf2-py3-GPU-Jupyter.sif

Prefer to write code that uses HDF5, netCDF, zip, tar directly. h5py is very easy to use.

Example: Running notebooks as batch jobs

#!/usr/bin/env bash
#SBATCH -A NAISS2024-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=A40:1

module purge
module load TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1
module load IPython/8.5.0-GCCcore-11.3.0

ipython -c "%run my-notebook.ipynb"

No need to wait for job to start just to press run in the notebook.
No extra time spent idling before and after computations.

Example: Job arrays

#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 5:00:00
#SBATCH --gpus-per-node=T4:2
#SBATCH --array=0-9
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end

module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 h5py/3.9.0-foss-2023a
python classification_problem.py dataset_$SLURM_ARRAY_TASK_ID.hdf5

More examples are available at Running job-arrays
Environment variables like $SLURM_ARRAY_TASK_ID can also be accessed from within all programming languages, e.g:

array_id = getenv('SLURM_ARRAY_TASK_ID'); % matlab

array_id = os.getenv('SLURM_ARRAY_TASK_ID') # python

Example: Multi-node

#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=T4:8
## 2 tasks across 2 nodes
#SBATCH --nodes 2 --ntasks 2

module load Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0

mpirun python horovod_keras_tf2_example.py

Multi-node jobs start on the first node which should then launch the rest (with mpirun/srun).
Make sure you are using the resources you request!
If using a container, you need to load a matching MPI from the module system

Interactive use

Alvis is a batch queue system, you should expect a queue sometimes. Bulk of simulations should be in queued batch jobs.
Use jobinfo or check portal footer to find idle GPUs.
Login node allows for light interactive use; it has 4 T4 GPUs (alvis1), but they are all shared.
- Use nvidia-smi to check current usage and select your GPU number with export CUDA_VISIBLE_DEVICES=X.`
Login node needs to be restarted occasionally; do not make your production runs rely on the login nodes uptime!
Several interactive apps that run on compute nodes are available through the portal.
If needed you can run interactively on compute nodes with srun, e.g.
- srun -A NAISS2023-X-Y -p alvis --gpus-per-node=T4:1 --pty bash

Jupyter Notebooks

Jupyter Notebooks can run on login node or on compute nodes.
- Follow steps on Jupyter guide.
Preferred way to launch interactive notebooks is via the Alvis OnDemand portal’s jupyter app.
You can run notebooks non-interactively (e.g. in jobscripts) with ipython -c "%run name-of-notebook-here.ipynb"

Portal

Open OnDemand portal https://alvis.c3se.chalmers.se
Can be used to launch notebooks, desktops etc. on nodes.
Can be used to view your disk and project usage.

Chatbot LLM

We provide a portal utility to qualitatively evaluate LLMs
For bulk tasks we recommend batch jobs
- See our vLLM documentation

Monitoring and profiling

For interactive jobs, use queue monitoring tools to see what is available
When running jobs, use monitoring tools to check usage
To improve your jobs profilers can be a big help

Queue monitoring

jobinfo shows you the queue and available GPUs
- Common reasons for queue status
  - Priority: Waiting for other queued jobs with higher priority.
  - Resources: Waiting for sufficient resources to be free.
  - AssocGrpBillingRunMinutes: We limit how much you can have running at once (<= 100% of 30-day allocation * 0.5^x where x is the number of stars in projinfo).
Alvis Grafana page shows state of login node and queue.
sinfo -Rl command shows reason if nodes are down (typically for maintenance)
scontrol show reservation shows reservations (e.g. planned maintenance)

Job and queue monitoring

job_stats.py JOBID gives you an URL to a public Grafana page for your job usage.
You can ssh into nodes when jobs are running, and for example run nvidia-smi or htop.

Profiling

With the right tools you can easily dive into where your code bottlenecks are, we recommend:
- TensorFlow: TensorBoard
- PyTorch: torch.profiler (possibly with TensorBoard)
- Python: Scalene
- Compiled CPU or GPU code: NVIDIA Nsight Systems
- MATLAB: Built in profiler
Tools can be used interactively on compute nodes with Open OnDemand portals!

Tensorboard for profiling

We have a Tensorboard guide
- Add a Tensorboard callback to generate logs to a job-specific directory (overlapping logs confuses Tensorboard!)
- Connecting via Alvis OnDemand (preferable), SSH tunnel or remote desktop.
- Tip: the SSH tunnel can also be used for running other services on nodes, like code-server.
Be aware of security, because Tensorboard offers none!

Things to keep in mind

Never run (big or long) jobs on the login node! otherwise, the misbehaving processes will be killed by the administrators
- If this is done repeatedly, you will be logged out, and your account will temporarily be blocked
You can however use the login node for interactively:
- Preparing your job and checking if everything’s OK before submitting the job
- Debugging a lightweight job and running tests
You are expected to keep an eye on how your job performs especially for new jobscripts/codes!
- Command line tools available on the login node and on the allocated nodes can help you check CPU/GPU, memory and network usage
Jobs where some or all of the allocated GPU:s are unused or very poorly utilized may be automatically terminated. We apply more stringent criteria for more expensive GPU types.
- We send warnings by email regarding poorly performing and automatically terminated jobs. Please check your spam folder if you are not seeing these emails.

Getting support

We provide support to our users, but not for any and all problems
- We can help you with software installation issues, and recommend compiler flags etc. for optimal performance
- We can install software system-wide if there are many users who need it - but probably not for one user (unless the installation is simple)
- We don’t support your application software or help debugging your code/model or prepare your input files.
- Book a time to meet us under office hours for help with things that are hard to put into a support request email

Identifying the issue

If you run into trouble, first figure out what seems to go wrong. Use the following as a checklist:
- make sure you simply aren’t over disk quota with C3SE_quota
- something wrong with your job script or input file?
- is there a bug in the program?
- any error messages? Look in your manuals, and use Google!
- check the metrics with job_stats.py: Did you over-allocate memory until your program was killed?
- Try to isolate the problem - does it go away if you run a smaller job? does it go away if you use your home directory instead of the local disk on the node?
- Try to create a test case - the smallest and simplest possible case that reproduces the problem

Error reports

In order to help you, we need as much and as good information as possible:
- What’s the job-ID of the failing job?
- What working directory and what job-script?
- What software are you using?
- What’s happening - especially error messages?
- Did this work before, or has it never worked?
- Do you have a minimal example?
- No need to attach files; just point us to a directory on the system.
Support cases must go through https://supr.naiss.se/support/

Further learning

We provide a more in-depth introduction through a self-paced course
- https://chalmers.instructure.com/courses/21205/
- You can join this course with your Alvis log-in
We try to provide a repository of example jobs and tutorials
- https://github.com/c3se/alvis-intro
- Can be used to get up and running with your first job quickly.
- Showcase useful tips and tricks which maybe even experienced users can learn from.
Other NAISS courses are listed at https://www.naiss.se/training/