Chalmers e-Commons/C3SE
2023-10-24
Technical specifications:
#GPUs | GPUs | Capability | CPU | Note |
---|---|---|---|---|
44 | V100 | 7.0 | Skylake | |
160 | T4 | 7.5 | Skylake | |
332 | A40 | 8.6 | Icelake | No IB |
296 | A100 | 8.0 | Icelake | Fast Mimer |
32 | A100fat | 8.0 | Icelake | Fast Mimer |
Data type | A100 | A40 | V100 | T4 |
---|---|---|---|---|
FP64 | 9.7 | 19.5* | 0.58 | 7.8 | 0.25 |
FP32 | 19.5 | 37.4 | 15.7 | 8.1 |
TF32 | 156** | 74.8** | N/A | N/A |
FP16 | 312** | 149.7** | 125 | 65 |
BF16 | 312** | 149.7** | N/A | N/A |
Int8 | 624** | 299.3** | 64 | 130 |
Int4 | 1248** | 598.7** | N/A | 260 |
alvis1.c3se.chalmers.se
, alvis2.c3se.chalmers.se
ssh CID@alvis1.c3se.chalmers.se
, ssh CID@alvis2.c3se.chalmers.se
alvis1
has 4 shared T4 GPUs and can be used for brief testing, development, debugging, and light pre/post-processing.alvis2
is the primary data transfer node.C3SE_quota
shows you all your centre storage areas, usage and quotas.Path: /cephyr/users/my_user
Space used: 17.5GiB Quota: 30GiB
Files used: 20559 Quota: 60000
Path: /mimer/NOBACKUP/groups/my_storage_project
Space used: 2646.5GiB Quota: 5000GiB
cd
, pwd
, ls
, cp
, mv
, rsync
, rmdir
, mkdir
, rm
/mimer/NOBACKUP/Datasets/
alvis2
has gotten a download service ADDS that can be used to fetch datasets. Use addsctl --help
to see usagesudo
rights!apt-get
!/apps/containers
apptainer build my_container.sif my_recipe.def
~/.bashrc
will likely break system utilities like Thinlinc (for you).module load Foo/1.2.3
or ml Foo/1.2.3
for loadingmodule list
or ml
to list all currently loaded modulesmodule spider Bar
or ml spider Bar
for searchingmodule keyword Bar
or ml keyword Bar
for searching keywords (e.g. extensions in python bundles)module purge
or ml purge
for unloading all modulespip
, apptainer
, conda
, and virtualenv
so you can install your own Python packages locally.
pip install --user
, this is likely to make a mess when used with any other approach and fill up your home directory quota quickly.pip
up to ~9x slower on Vera.job_stats.py
, nvidia-smi
, …).sbatch
: submit batch jobssrun
: submit interactive jobsjobinfo
(squeue
): view the job-queue and the state of jobs in queue, shows amount of idling resourcesscontrol show job <jobid>
: show details about job, including reasons why it’s pendingsprio
: show all your pending jobs and their priorityscancel
: cancel a running or pending jobsinfo
: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.sacct
: show scheduling information about past jobsprojinfo
: show the projects you belong to, including monthly allocation and usage#SBATCH --gpus-per-node=V100:2
#SBATCH --gpus-per-node=T4:3
#SBATCH --gpus-per-node=A100:1
-C
to pick the nodes with more RAM:
#SBATCH --gpus-per-node=V100:2 -C 2xV100
(only 2 V100 on these nodes, thus twice the RAM per gpu)#SBATCH --gpus-per-node=T4:1 -C MEM1536
#SBATCH --gpus-per-node=T4:8 -N 2 --cpus-per-task=32
#SBATCH -N 2 --gres=ptmpdir:1
#SBATCH -C NOGPU
#SBATCH --gres=gpuexlc:1,mps:1
Type | VRAM | System memory per GPU | CPU cores per GPU | Cost |
---|---|---|---|---|
T4 | 16GB | 72 or 192 GB | 4 | 0.35 |
A40 | 48GB | 64 GB | 16 | 1 |
V100 | 32GB | 192 or 384 GB | 8 | 1.31 |
A100 | 40GB | 64 or 128 GB | 16 | 1.84 |
A100fat | 80GB | 256 GB | 16 | 2.2 |
$CUDA_VISIBLE_DEVICES
you can make sure that your application has correctly picked up the hardwaresrun -A YOUR_ACCOUNT -t 00:02:00 --gpus-per-node=V100:2 --pty bash
srun: job 22441 queued and waiting for resources
srun: job 22441 has been allocated resources
$ echo ${CUDA_VISIBLE_DEVICES}
0,1
#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=V100:1
unzip many_tiny_files_dataset.zip -d $TMPDIR/
apptainer exec --nv ~/tensorflow-2.1.0.sif trainer.py --training_input=$TMPDIR/
# or use available containers e.g.
# /apps/containers/TensorFlow/TensorFlow_v2.3.1-tf2-py3-GPU-Jupyter.sif
h5py
is very easy to use.#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=A40:1
module purge
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
module load IPython/8.5.0-GCCcore-11.3.0
ipython -c "%run my-notebook.ipynb"
#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 5:00:00
#SBATCH --gpus-per-node=T4:2
#SBATCH --array=0-9
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 h5py/3.9.0-foss-2023a
python classification_problem.py dataset_$SLURM_ARRAY_TASK_ID.hdf5
$SLURM_ARRAY_TASK_ID
can also be accessed from within all programming languages, e.g:#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 1-00:00:00
#SBATCH --gpus-per-node=T4:8
## 2 tasks across 2 nodes
#SBATCH --nodes 2 --ntasks 2
module load fosscuda/2019b Horovod/0.19.1-TensorFlow-2.1.0-Python-3.7.4
mpirun python horovod_keras_tf2_example.py
mpirun
/srun
).jobinfo
to pick what happens to be idle.nvidia-smi
to check current usage and select your GPU number with export CUDA_VISIBLE_DEVICES=X
.`srun
, e.g.
srun -A NAISS2023-X-Y -p alvis --gpus-per-node=T4:1 --pty bash
ipython -c "%run name-of-notebook-here.ipynb"
jobinfo
shows you the queue and available GPUs
ssh
into nodes when jobs are running, and for example run nvidia-smi
or htop
.projinfo
).job_stats.py JOBID
gives you an URL to a public Grafana page for your job usage.sinfo -Rl
command shows reason if nodes are down (typically for maintenance)torch.profiler
(possibly with TensorBoard)code-server
.C3SE_quota
job_stats.py
: Did you over-allocate memory until your program was killed?