Vera

Vera hardware

The Vera cluster is built on Intel Xeon Gold 6130 (code-named "skylake") CPU's. The system consists of:

  • In total 245 compute nodes (total of 7848 cores) with a total of 28 TiB of RAM and 13 GPUs. More specific:
    • 209 compute nodes with 32 cores and 96 GB of RAM
    • 18 compute nodes with 32 cores and 192 GB of RAM
    • 6 compute nodes with 32 cores and 384 GB of RAM
    • 2 compute nodes with 32 cores and 768 GB of RAM
    • 2 compute nodes with 32 cores, 384 GB of RAM and 2 NVIDIA Tesla V100 32 GB SMX2 GPU:s each
    • 1 compute nodes with 40 cores (Intel Xeon Gold 6230), 384 GB of RAM, 4 NVIDIA Tesla V100 32 GB SMX2 GPU:s and 13 TB of fast local NVMe storage
    • 5 compute nodes with with 32 cores and 92 GB of RAM an 1 NVIDIA Tesla T4 GPU each
    • 2 login nodes with 32 cores, 192 GB of RAM and NVIDIA P2000 for remote graphics

There are also 3 system servers used for accessing and managing the cluster.

There's a 25Gigabit Ethernet network used for logins, a dedicated management network and an Infiniband high-speed/low-latency network for parallel computations and filesystem access. The nodes are equipped with Mellanox ConnectX-3 FDR Infiniband 56Gbps HCA's.

The servers are build by Supermicro and the compute node hardware by Intel, the system is delivered by Southpole.

Cores, threads and CPU:s

One thing to note that is different from the previous systems at C3SE is that Hyper-Threading (HT for short) is enabled on Vera nodes.

Each Vera node have 2 physical processors, with 16 (physical) cores each (giving a total of 32 cores per node). With HT enabled (giving 2 threads per core, and a total of 64 threads per node) the following must be taken into consideration:

  • If your code is heavily optimised for the Vera hardware, you probably will not benefit from HT and should only use 1 task per core. To use this add "-c 2" (or "--cpus-per-task=2") to your jobscript, or the commandline.
  • You will probably want to benchmark using "-n X", "-n X -c 2" and "-n 2X", where X is the number of MPI-processes that will be launched.
  • mpirun automatically picks up the relevant information from Slurm, so you probably only want "mpirun ./my.exe" in your jobscript (i.e. no "-n" or "-np" flags).
  • Slurm will only allocate you full core, i.e. you will only get even number of tasks if you do not use "-c 2"
  • Specifying only "-n1" will actually give you 2 tasks/threads to use (one physical core).
  • In $TMPDIR you will find task-files in MPICH and LAM format:
    • with all tasks: $TMPDIR/mpichnodes, $TMPDIR/lamnodes
    • with physical cores only: $TMPDIR/mpichnodes.no_HT, $TMPDIR/lamnodes.no_HT

For general information on running jobs, see running jobs

GPU cost on Vera

Jobs "cost" based on the number of physical cores they allocate, plus

Type VRAM Additional cost
T4 16GB 6
A40* 48GB 16
V100 32GB 20
A100* 40GB 48
  • Example: A job using a full node with a single T4 for 10 hours: (32 + 6) * 10 = 380 core hours
  • Note: 16, 32, and 64 bit floating point performance differ greatly between these specialized GPUs. Pick the one most efficient for your application.
  • Additional running cost is based on the price compared to a CPU node.
  • You don't pay any extra for selecting a node with more memory; but you are typically competing for less available hardware.

Support

If you need some kind of support (trouble logging in, how to run your software, etc.) please first

  • Contact the PI of your project and see if he/she can help
  • Talk with your fellow students/colleagues
  • Contact C3SE support