The Vera cluster is built on Intel Xeon Gold 6130 (code-named "skylake") CPU's. The system consists of:
- In total 245 compute nodes (total of 7848 cores) with a total of 28 TiB of RAM and 13 GPUs. More specific:
- 209 compute nodes with 32 cores and 96 GB of RAM
- 18 compute nodes with 32 cores and 192 GB of RAM
- 6 compute nodes with 32 cores and 384 GB of RAM
- 2 compute nodes with 32 cores and 768 GB of RAM
- 2 compute nodes with 32 cores, 384 GB of RAM and 2 NVIDIA Tesla V100 32 GB SMX2 GPU:s each
- 1 compute nodes with 40 cores (Intel Xeon Gold 6230), 384 GB of RAM, 4 NVIDIA Tesla V100 32 GB SMX2 GPU:s and 13 TB of fast local NVMe storage
- 5 compute nodes with with 32 cores and 92 GB of RAM an 1 NVIDIA Tesla T4 GPU each
- 2 login nodes with 32 cores, 192 GB of RAM and NVIDIA P2000 for remote graphics
There are also 3 system servers used for accessing and managing the cluster.
There's a 25Gigabit Ethernet network used for logins, a dedicated management network and an Infiniband high-speed/low-latency network for parallel computations and filesystem access. The nodes are equipped with Mellanox ConnectX-3 FDR Infiniband 56Gbps HCA's.
The servers are build by Supermicro and the compute node hardware by Intel, the system is delivered by Southpole.
Cores, threads and CPU:s¶
One thing to note that is different from the previous systems at C3SE is that Hyper-Threading (HT for short) is enabled on Vera nodes.
Each Vera node have 2 physical processors, with 16 (physical) cores each (giving a total of 32 cores per node). With HT enabled (giving 2 threads per core, and a total of 64 threads per node) the following must be taken into consideration:
- If your code is heavily optimised for the Vera hardware, you probably will not benefit from HT and should only use 1 task per core. To use this add "-c 2" (or "--cpus-per-task=2") to your jobscript, or the commandline.
- You will probably want to benchmark using "-n X", "-n X -c 2" and "-n 2X", where X is the number of MPI-processes that will be launched.
- mpirun automatically picks up the relevant information from Slurm, so you probably only want "mpirun ./my.exe" in your jobscript (i.e. no "-n" or "-np" flags).
- Slurm will only allocate you full core, i.e. you will only get even number of tasks if you do not use "-c 2"
- Specifying only "-n1" will actually give you 2 tasks/threads to use (one physical core).
- In $TMPDIR you will find task-files in MPICH and LAM format:
- with all tasks: $TMPDIR/mpichnodes, $TMPDIR/lamnodes
- with physical cores only: $TMPDIR/mpichnodes.no_HT, $TMPDIR/lamnodes.no_HT
For general information on running jobs, see running jobs
GPU cost on Vera¶
Jobs "cost" based on the number of physical cores they allocate, plus
- Example: A job using a full node with a single T4 for 10 hours:
(32 + 6) * 10 = 380core hours
- Note: 16, 32, and 64 bit floating point performance differ greatly between these specialized GPUs. Pick the one most efficient for your application.
- Additional running cost is based on the price compared to a CPU node.
- You don't pay any extra for selecting a node with more memory; but you are typically competing for less available hardware.
If you need some kind of support (trouble logging in, how to run your software, etc.) please first
- Contact the PI of your project and see if he/she can help
- Talk with your fellow students/colleagues
- Contact C3SE support