Chalmers e-Commons/C3SE
2025-01-15
Technical specifications
| #GPUs | GPUs | Capability | CPU | Note |
|---|---|---|---|---|
| 44 | V100 | 7.0 | Skylake | |
| 160 | T4 | 7.5 | Skylake | |
| 332 | A40 | 8.6 | Icelake | No IB |
| 296 | A100 | 8.0 | Icelake | Fast Mimer |
| 32 | A100fat | 8.0 | Icelake | Fast Mimer |


alvis1.c3se.chalmers.se has 4 T4 GPUs for light testing and debuggingalvis2.c3se.chalmers.se is dedicated data transfer nodessh <CID>@alvis1.c3se.chalmers.se, ssh <CID>@alvis2.c3se.chalmers.se
/cephyr/, and Mimer /mimer/ are parallel filesytems, accessible from all nodes./cephyr/users/<CID>/Alvis (alt. use ~)/mimer/NOBACKUP/groups/<storage-name>C3SE_quota shows you all your centre storage areas, usage and quotas.
where-are-my-filesscp, rsync, rclonecd, pwd, ls, cp, mv, rsync, rmdir, mkdir, rmnano, vim, emacs, …/mimer/NOBACKUP/Datasets/sudo rights!apt-get!/apps/containersapptainer build my_container.sif my_recipe.def~/.bashrc will likely break some system utilities for you.module load Foo/1.2.3 or ml Foo/1.2.3 for loadingmodule list or ml to list all currently loaded modulesmodule spider Bar or ml spider Bar for searchingmodule keyword Bar or ml keyword Bar for searching keywords (e.g. extensions in python bundles)module purge or ml purge for unloading all modulespip, apptainer, conda, and virtualenv so you can install your own Python packages locally.
pip install --user, this is likely to make a mess when used with any other approach and fill up your home directory quota quickly.pip up to ~9x slower on Vera.job_stats.py, nvidia-smi, …)


sbatch: submit batch jobssrun: submit interactive jobsjobinfo (squeue): view the job-queue and the state of jobs in queue, shows amount of idling resourcesscontrol show job <jobid>: show details about job, including reasons why it’s pendingsprio: show all your pending jobs and their priorityscancel: cancel a running or pending jobsinfo: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.sacct: show scheduling information about past jobsprojinfo: show the projects you belong to, including monthly allocation and usage#SBATCH --gpus-per-node=V100:2#SBATCH --gpus-per-node=T4:3#SBATCH --gpus-per-node=A100:1-C to pick the nodes with more RAM:
#SBATCH --gpus-per-node=V100:2 -C 2xV100 (only 2 V100 on these nodes, thus twice the RAM per gpu)#SBATCH --gpus-per-node=T4:1 -C MEM1536#SBATCH -C NOGPU#SBATCH --gpus-per-node=T4:8 -N 2 --cpus-per-task=32#SBATCH -N 2 --gres=ptmpdir:1#SBATCH --gres=gpuexlc:1,mps:1| Type | VRAM | System memory per GPU | CPU cores per GPU | Cost |
|---|---|---|---|---|
| T4 | 16GB | 72 or 192 GB | 4 | 0.35 |
| A40 | 48GB | 64 GB | 16 | 1 |
| V100 | 32GB | 192 or 384 GB | 8 | 1.31 |
| A100 | 40GB | 64 or 128 GB | 16 | 1.84 |
| A100fat | 80GB | 256 GB | 16 | 2.2 |
| Data type | A100 | A40 | V100 | T4 |
|---|---|---|---|---|
| FP64 | 9.7 | 19.5* | 0.58 | 7.8 | 0.25 |
| FP32 | 19.5 | 37.4 | 15.7 | 8.1 |
| TF32 | 156** | 74.8** | N/A | N/A |
| FP16 | 312** | 149.7** | 125 | 65 |
| BF16 | 312** | 149.7** | N/A | N/A |
| Int8 | 624** | 299.3** | 64 | 130 |
| Int4 | 1248** | 598.7** | N/A | 260 |
#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=V100:1
unzip many_tiny_files_dataset.zip -d $TMPDIR/
apptainer exec --nv ~/tensorflow-2.1.0.sif trainer.py --training_input=$TMPDIR/
# or use available containers e.g.
# /apps/containers/TensorFlow/TensorFlow_v2.3.1-tf2-py3-GPU-Jupyter.sifh5py is very easy to use.#!/usr/bin/env bash
#SBATCH -A NAISS2024-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=A40:1
module purge
module load TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1
module load IPython/8.5.0-GCCcore-11.3.0
ipython -c "%run my-notebook.ipynb"#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 5:00:00
#SBATCH --gpus-per-node=T4:2
#SBATCH --array=0-9
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 h5py/3.9.0-foss-2023a
python classification_problem.py dataset_$SLURM_ARRAY_TASK_ID.hdf5$SLURM_ARRAY_TASK_ID can also be accessed from within all programming languages, e.g:#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=T4:8
## 2 tasks across 2 nodes
#SBATCH --nodes 2 --ntasks 2
module load Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0
mpirun python horovod_keras_tf2_example.pympirun/srun).jobinfo or check portal footer to find idle GPUs.nvidia-smi to check current usage and select your GPU number with export CUDA_VISIBLE_DEVICES=X.`srun, e.g.
srun -A NAISS2023-X-Y -p alvis --gpus-per-node=T4:1 --pty bashipython -c "%run name-of-notebook-here.ipynb"jobinfo shows you the queue and available GPUs
projinfo).sinfo -Rl command shows reason if nodes are down (typically for maintenance)scontrol show reservation shows reservations (e.g. planned maintenance)job_stats.py JOBID gives you an URL to a public Grafana page for your job usage.ssh into nodes when jobs are running, and for example run nvidia-smi or htop.torch.profiler (possibly with TensorBoard)code-server.C3SE_quotajob_stats.py: Did you over-allocate memory until your program was killed?