2025-02-05
sudo
rights!apt-get
!#GPUs | GPUs | FP16 TFLOP/s | FP32 | FP64 | Capability |
---|---|---|---|---|---|
16 | A40 | 37.4 | 37.4 | 0.58 | 8.6 |
12 | A100 | 77.9 | 19.5 | 9.7 | 8.0 |
8 | H100 | 248 | 62 | 30 | 9.0 |
# nodes | type | ||||
84 | Zen4 | ~12 | ~6 | (full 64 cores) | |
63 | Icelake | ~8 | ~4 | (full 64 cores) |
-C ICELAKE
.buildenv/default-foss-2024a
or similar modules to build your own software. -march=native
is a flag you may want to include if you are not using the CFLAGS
, FFLAGS
, etc from the environment variables.AOCC/4.2.0-GCCcore-13.3.0
for foss-2024a toolchain.Clang/16.0.6-GCCcore-13.3.0
.FLEXIBLAS
environment variables , for example, FLEXIBLAS=OpenBLAS <your executable>
(remember to load proper modules)alvis1
has 4 T4 GPUs for testing, development (Skylake CPUs)alvis2
is primarily a data transfer node (Icelake CPUs)Senior researchers at Swedish universities are eligible to apply for NAISS projects on other centres; they may have more time or specialised hardware that suits you; e.g. GPU, large memory nodes, sensitive data.
We are also part of:
You can find information about NAISS resources on https://supr.naiss.se and https://www.naiss.se
Contact the support if you are unsure about what you should apply for.
For a list of PIs with allocations, you can check this list.
ssh CID@vera1.c3se.chalmers.se
or CID@vera2.c3se.chalmers.se
(accessible within Chalmers and GU network)/apps/portal/
to your home dir ~/portal/
.$ ls -l
...a list of files...
ls
, list files in working directorypwd
, print current working directory (“where am I”)cd directory_name
, change working directorycp src_file dst_file
, copy a filerm file
, delete a file (there is no undelete!)mv nameA nameB
, rename nameA to nameBmkdir dirname
, create a directory See also grep, find, less, chgrp, chmod
b
- to scroll up one screen pageq
- to quit from the current man page/
- search (type in word, enter)n
- find next search match (N for reverse)h
- to get further help (how to search the man page etc)$HOME = /cephyr/users/<CID>/Vera
$HOME = /cephyr/users/<CID>/Alvis
C3SE_quota
to check your current quota on all your active storage areas in shell/cephyr/users/<CID>/
)
where-are-my-files
to find file quota on Cephyr.dust
or dust -f
to check disk space usage and file quota.module load module-name [module-name ...]
PATH
, LD_LIBRARY_PATH
, PYTHONPATH
etc. making the software available.$ mathematica --version
-bash: mathematica: command not found
$ module load Mathematica/13.0.0
$ mathematica --version
13.0
~/.bashrc
. You will break things like the desktop. Load modules in each jobscript to make them self contained, otherwise it’s impossible for us to offer support.virtualenv
, apptainer
, conda
(least preferable) so you can install your own Python packages locally. https://www.c3se.chalmers.se/documentation/module_system/python/pip install --user
. They will leak into containers and other environments, and will quickly eat up your quota.buildenv/default-foss-2023a-CUDA-12.1.1
provides a build environment with GCC, OpenMPI, OpenBLAS/BLIS, CUDACMake
, Autotools
, git
, …Python
, Perl
, Rust
, …LIBRARY_PATH
and other environment variables and more which can often be automatically picked up by good build systems.--with-python=$EBROOTPYTHON
--prefix=path_to_local_install
)
sudo
to perform steps. They are wrong.-march=native
optimizes your code for the current CPU model.apptainer build my.sif my.def
from a given definition file, e.g:Bootstrap: docker
From: continuumio/miniconda3:4.12.0
%files
requirements.txt
%post
/opt/conda/bin/conda install -y --file requirements.txt
Bootstrap: localimage
From: path/to/existing/container.sif
%post
/opt/conda/bin/conda install -y matplotlib
%environment
projinfo
lists your projects and current usage. projinfo -D
breaks down usage day-by-day (up to 30 days back). Project Used[h] Allocated[h] Queue
User
-------------------------------------------------------
C3SE2017-1-8 15227.88* 10000 vera
razanica 10807.00*
kjellm 2176.64*
robina 2035.88* <-- star means we are over 100% usage
dawu 150.59* which means this project has lowered priority
framby 40.76*
-------------------------------------------------------
C3SE507-15-6 9035.27 28000 mob
knutan 5298.46
robina 3519.03
kjellm 210.91
ohmanm 4.84
jobinfo -p vera
command shows the current state of nodes in the main partitionNode type usage on main partition:
TYPE ALLOCATED IDLE OFFLINE TOTAL
ICELAKE,MEM1024 1 3 0 4
ICELAKE,MEM512 41 16 0 57
ZEN4,MEM1536 0 2 0 2
ZEN4,MEM768 1 97 0 98
(...the nodes below are retiring)
SKYLAKE,MEM192 15 2 0 17
SKYLAKE,MEM384 0 6 0 6
SKYLAKE,MEM768 2 0 0 2
SKYLAKE,MEM96,25G 20 0 0 20
SKYLAKE,MEM96 170 5 4 179
Total GPU usage:
TYPE ALLOCATED IDLE OFFLINE TOTAL
A40 5 7 4 16
A100 8 4 0 12
H100 0 0 8 8
(...the nodes below are retiring)
V100 4 4 0 8
sbatch <arguments> script.sh
script.sh
as well as on the command line#!/bin/bash
#SBATCH -A C3SE2024-11-05
#SBATCH -p vera
#SBATCH -n 4
#SBATCH -t 2-00:00:00
echo "Hello world"
-C ZEN4
and -C ICELAKE
.-C ZEN4
.-C MEM1024
requests a 1024GB node - 9 (icelake) total (3 private)-C MEM1536
requests a 1536GB node - 2 (zen4) total (1 private)-C MEM2048
requests a 2048GB node - 3 (icelake) total (all private)--gpus-per-node
, the node type is bound
--gpus-per-node=A40:4
requests 4 A40 (icelake)--gpus-per-node=A100:2
requests 2 A100 (icelake)--gpus-per-node=H100:1
requests 1 H100 (zen4)-C
) unless you know you need them.-C SKYLAKE
still works if you want to finish your work, but it will be remove in a few weeks.Type | VRAM | Additional cost |
---|---|---|
A40 | 48GB | 16 |
A100 | 40GB | 48 |
H100 | 96GB | 160 |
(64/4 + 48) * 10 = 640
core hours#!/bin/bash
#SBATCH -A C3SE2024-11-05
#SBATCH -p vera
#SBATCH -n 32
#SBATCH -t 2-00:00:00
module purge
module load SciPy-bundle/2024.05-gfbf-2024a
source /cephyr/NOBACKUP/groups/naiss2024-xx-xxx/myenv/bin/activate
python myscript.py --input=input.txt
#!/bin/bash
#SBATCH -A C3SE2024-11-05
#SBATCH -t 2-00:00:00
#SBATCH --gpus-per-node=A40:2
apptainer exec --nv tensorflow-2.1.0.sif python cat_recognizer.py
More on containers
--gres ptmpdir:1
)$SLURM_SUBMIT_DIR
is defined in jobs, and points to where you submitted your job.$TMPDIR
: local scratch disk on the node(s) of your jobs. Automatically deleted when the job has finished.$TMPDIR
?
$TMPDIR
is if your program only loads data in one read operation, processes it, and writes the output.$TMPDIR
for jobs that perform intensive file I/O$TMPDIR
!/cephyr/...
or /mimer/...
means the network-attached permanent storage is used.sbatch --gres=ptmpdir:1
you get a distributed, parallel $TMPDIR
across all nodes in your job. Always recommended for multi-node jobs that use $TMPDIR.sbatch --array=0-99 wind_turbine.sh
#!/bin/bash
#SBATCH -A C3SE2024-11-05
#SBATCH -n 1
#SBATCH -C "ICELAKE|ZEN4"
#SBATCH -t 15:00:00
#SBATCH --mail-user=zapp.brannigan@chalmers.se --mail-type=end
module load MATLAB
cp wind_load_$SLURM_ARRAY_TASK_ID.mat $TMPDIR/wind_load.mat
cp wind_turbine.m $TMPDIR
cd $TMPDIR
RunMatlab.sh -f wind_turbine.m
cp out.mat $SLURM_SUBMIT_DIR/out_$SLURM_ARRAY_TASK_ID.mat
$SLURM_ARRAY_TASK_ID
can also be accessed from within all programming languages, e.g:sbatch --array=0-50:5 diffusion.sh
#!/bin/bash
#SBATCH -A C3SE2024-11-05
#SBATCH -C ICELAKE
#SBATCH -n 128 -t 2-00:00:00
module load intel/2023a
## Set up new folder, copy the input file there
temperature=$SLURM_ARRAY_TASK_ID
dir=temp_$temperature
mkdir $dir; cd $dir
cp $HOME/base_input.in input.in
## Set the temperature in the input file:
sed -i 's/TEMPERATURE_PLACEHOLDER/$temperature' input.in
mpirun $HOME/software/my_md_tool -f input.in
Here, the array index is used directly as input. If it turns out that 50 degrees was insufficient, then we could do another run:
Submitted with: sbatch run_oofem.sh
#!/bin/bash
#SBATCH -A C3SE507-15-6 -p mob
#SBATCH --ntasks-per-node=32 -N 3
#SBATCH -J residual_stress
#SBATCH -t 6-00:00:00
#SBATCH --gres=ptmpdir:1
module load PETSc
cp $SLURM_JOB_NAME.in $TMPDIR
cd $TMPDIR
mkdir $SLURM_SUBMIT_DIR/$SLURM_JOB_NAME
while sleep 1h; do
rsync -a *.vtu $SLURM_SUBMIT_DIR/$SLURM_JOB_NAME
done &
LOOPPID=$!
mpirun $HOME/bin/oofem -p -f "$SLURM_JOB_NAME.in"
kill $LOOPPID
rsync -a *.vtu $SLURM_SUBMIT_DIR/$SLURM_JOBNAME/
You are allowed to use the ThinLinc machines for light/moderate tasks that require interactive input. If you need all cores, or load for a extended duration, you must run on the nodes:
you are eventually presented with a shell on the node:
[ohmanm@vera12-3]#
srun
interactive jobs will be aborted if the login node needs to be rebooted or loss of internet connectivity. Prefer always using the portal.sbatch
: submit batch jobssrun
: submit interactive jobsjobinfo
, squeue
: view the job-queue and the state of jobs in queuescontrol show job <jobid>
: show details about job, including reasons why it’s pendingsprio
: show all your pending jobs and their priorityscancel
: cancel a running or pending jobsinfo
: show status for the partitions (queues): how many nodes are free, how many are down, busy, etc.sacct
: show scheduling information about past jobsprojinfo
: shows the projects you belong to, including monthly allocation and usagejobinfo -u $USER
:
projinfo
).job_stats.py JOBID
is essential.
sinfo -Rl
command shows how many nodes are down for repair.torch.profiler
(possibly with TensorBoard).bashrc
file, expect things like the desktop session to break. Many support tickets are answered by simply clearing these out.job_stats.py
!C3SE_quota