Chalmers e-Commons/C3SE
2025-01-15
Technical specifications
#GPUs | GPUs | Capability | CPU | Note |
---|---|---|---|---|
44 | V100 | 7.0 | Skylake | |
160 | T4 | 7.5 | Skylake | |
332 | A40 | 8.6 | Icelake | No IB |
296 | A100 | 8.0 | Icelake | Fast Mimer |
32 | A100fat | 8.0 | Icelake | Fast Mimer |
alvis1.c3se.chalmers.se
has 4 T4 GPUs for light testing and debuggingalvis2.c3se.chalmers.se
is dedicated data transfer nodessh <CID>@alvis1.c3se.chalmers.se
, ssh <CID>@alvis2.c3se.chalmers.se
https://alvis1.c3se.chalmers.se:300/
, https://alvis2.c3se.chalmers.se:300/
/cephyr/
, and Mimer /mimer/
are parallel filesytems, accessible from all nodes./cephyr/users/<CID>/Alvis
(alt. use ~
)/mimer/NOBACKUP/groups/<storage-name>
C3SE_quota
shows you all your centre storage areas, usage and quotas.
where-are-my-files
scp
, rsync
, rclone
cd
, pwd
, ls
, cp
, mv
, rsync
, rmdir
, mkdir
, rm
nano
, vim
, emacs
, …/mimer/NOBACKUP/Datasets/
sudo
rights!apt-get
!/apps/containers
apptainer build my_container.sif my_recipe.def
~/.bashrc
will likely break system utilities like Thinlinc (for you).module load Foo/1.2.3
or ml Foo/1.2.3
for loadingmodule list
or ml
to list all currently loaded modulesmodule spider Bar
or ml spider Bar
for searchingmodule keyword Bar
or ml keyword Bar
for searching keywords (e.g. extensions in python bundles)module purge
or ml purge
for unloading all modulespip
, apptainer
, conda
, and virtualenv
so you can install your own Python packages locally.
pip install --user
, this is likely to make a mess when used with any other approach and fill up your home directory quota quickly.pip
up to ~9x slower on Vera.job_stats.py
, nvidia-smi
, …)sbatch
: submit batch jobssrun
: submit interactive jobsjobinfo
(squeue
): view the job-queue and the state of jobs in queue, shows amount of idling resourcesscontrol show job <jobid>
: show details about job, including reasons why it’s pendingsprio
: show all your pending jobs and their priorityscancel
: cancel a running or pending jobsinfo
: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.sacct
: show scheduling information about past jobsprojinfo
: show the projects you belong to, including monthly allocation and usage#SBATCH --gpus-per-node=V100:2
#SBATCH --gpus-per-node=T4:3
#SBATCH --gpus-per-node=A100:1
-C
to pick the nodes with more RAM:
#SBATCH --gpus-per-node=V100:2 -C 2xV100
(only 2 V100 on these nodes, thus twice the RAM per gpu)#SBATCH --gpus-per-node=T4:1 -C MEM1536
#SBATCH -C NOGPU
#SBATCH --gpus-per-node=T4:8 -N 2 --cpus-per-task=32
#SBATCH -N 2 --gres=ptmpdir:1
#SBATCH --gres=gpuexlc:1,mps:1
Type | VRAM | System memory per GPU | CPU cores per GPU | Cost |
---|---|---|---|---|
T4 | 16GB | 72 or 192 GB | 4 | 0.35 |
A40 | 48GB | 64 GB | 16 | 1 |
V100 | 32GB | 192 or 384 GB | 8 | 1.31 |
A100 | 40GB | 64 or 128 GB | 16 | 1.84 |
A100fat | 80GB | 256 GB | 16 | 2.2 |
Data type | A100 | A40 | V100 | T4 |
---|---|---|---|---|
FP64 | 9.7 | 19.5* | 0.58 | 7.8 | 0.25 |
FP32 | 19.5 | 37.4 | 15.7 | 8.1 |
TF32 | 156** | 74.8** | N/A | N/A |
FP16 | 312** | 149.7** | 125 | 65 |
BF16 | 312** | 149.7** | N/A | N/A |
Int8 | 624** | 299.3** | 64 | 130 |
Int4 | 1248** | 598.7** | N/A | 260 |
$CUDA_VISIBLE_DEVICES
you can make sure that your application has correctly picked up the hardwaresrun -A YOUR_ACCOUNT -t 00:02:00 --gpus-per-node=V100:2 --pty bash
srun: job 22441 queued and waiting for resources
srun: job 22441 has been allocated resources
$ echo ${CUDA_VISIBLE_DEVICES}
0,1
#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=V100:1
unzip many_tiny_files_dataset.zip -d $TMPDIR/
apptainer exec --nv ~/tensorflow-2.1.0.sif trainer.py --training_input=$TMPDIR/
# or use available containers e.g.
# /apps/containers/TensorFlow/TensorFlow_v2.3.1-tf2-py3-GPU-Jupyter.sif
h5py
is very easy to use.#!/usr/bin/env bash
#SBATCH -A NAISS2024-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=A40:1
module purge
module load TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1
module load IPython/8.5.0-GCCcore-11.3.0
ipython -c "%run my-notebook.ipynb"
#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 5:00:00
#SBATCH --gpus-per-node=T4:2
#SBATCH --array=0-9
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 h5py/3.9.0-foss-2023a
python classification_problem.py dataset_$SLURM_ARRAY_TASK_ID.hdf5
$SLURM_ARRAY_TASK_ID
can also be accessed from within all programming languages, e.g:#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=T4:8
## 2 tasks across 2 nodes
#SBATCH --nodes 2 --ntasks 2
module load Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0
mpirun python horovod_keras_tf2_example.py
mpirun
/srun
).jobinfo
or check portal footer to find idle GPUs.nvidia-smi
to check current usage and select your GPU number with export CUDA_VISIBLE_DEVICES=X
.`srun
, e.g.
srun -A NAISS2023-X-Y -p alvis --gpus-per-node=T4:1 --pty bash
ipython -c "%run name-of-notebook-here.ipynb"
jobinfo
shows you the queue and available GPUs
projinfo
).sinfo -Rl
command shows reason if nodes are down (typically for maintenance)scontrol show reservation
shows reservations (e.g. planned maintenance)job_stats.py JOBID
gives you an URL to a public Grafana page for your job usage.ssh
into nodes when jobs are running, and for example run nvidia-smi
or htop
.torch.profiler
(possibly with TensorBoard)code-server
.C3SE_quota
job_stats.py
: Did you over-allocate memory until your program was killed?